Allow user submitted HTML in PHP - php

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html

HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.

Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!

HTML Purifier is the best HTML parser/cleaner out there.

For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.

You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link

Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.

It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

Related

Stop Script Tags But Allow Others [duplicate]

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

PHP strip_tags not allowing less than '<' in string

Please let me know how to allow less than character '<' in strip_tags()
Code Snippet
$string ="abc<123";
StringFromUser($string);
function StringFromUser($string)
{
if (is_string($string))
{
return strip_tags($string);
}
}
Output : abc
Expected output abc<123
Encode it properly in the first place.
$string ="abc<123";
Although if you're not sanitizing for HTML output you shouldn't be using strip_tags() anyway.
strip_tags is a pretty basic and not very good way to sanitize data (i.e. "punch arbitrary values into shape"). Again, it's not a very good function, as you are seeing. You should only sanitize data if you have a very good reason to, oftentimes there is no good reason. Ask yourself what you are gaining from arbitrarily stripping out parts of a value.
You either want to validate or escape to avoid syntax problems and/or injection attacks. Sanitization is rarely the right thing to do. Read The Great Escapism (Or: What You Need To Know To Work With Text Within Text) for more background on the whole topic.
You could search for a character in your string, take it out, strip_tags() your string and put the character back in:
$string = "abc<123";
$character = "<";
$pos = strpos($string,$character);
$tag = ">";
$check = strpos($string,$tag);
if ($pos !== false && $check == false) {
$string_array = explode("<",$string);
$string = $string_array[0];
$string .= $string_array[1];
$string = strip_tags($string);
$length = strlen($string);
$substr = substr($string, 0, $pos);
$substr .= "<";
$substr .= substr($string, $pos, $length);
$string = $substr;
} else {
$string = strip_tags($string);
}
or you could use preg_replace() to replace all the characters you don't want to have in your $string.
The problem:
The purpose to use trip_tags is to prevent attacking from HTML or PHP injection. However, trip_tags not only removes HTML and PHP tags, it also removes part of a math expression with a < operator. So, what we see is "abc<123" being replaced to "abc".
The solution:
What we know is a < followed by a space is not identified as HTML or PHP tags by strip_tags. So what I do is to replace "abc<123" to "abc< myUniqueID123". Please note there is a space followed the < sign. And also, only numbers followed the < sign are replaced. Next, strip_tags the string. Finally, replace "abc< myUniqueID123" back to "abc<123".
$string = "abc<123";
echo StringFromUser($string);
function StringFromUser($string)
{
if (is_string($string)) {
//change "abc<123" to "abc< myUniqueID123", so math expressions are not stripped.
//use myQuniqueID to identity what we have changed later.
$string = preg_replace("/(<)(\d)/", "$1 myUniqueID$2", $string);
$string = strip_tags($string);
//change "abc< myUniqueID123" back to "abc<123"
$string = preg_replace("/(<) myUniqueID(\d)/", "$1$2", $string);
return $string;
}
}

PHP Regular expression: exclude href anchor tags

I'm creating a simple search for my application.
I'm using PHP regular expression replacement (preg_replace) to look for a search term (case insensitive) and add <strong> tags around the search term.
preg_replace('/'.$query.'/i', '<strong>$0</strong>', $content);
Now I'm not the greatest with regular expressions. So what would I add to the regular expression to not replace search terms that are in a href of an anchor tag?
That way if someone searched "info" it wouldn't change a link to "http://something.com/this_<strong>info</strong>/index.html"
I believe you will need conditional subpatterns] for this purpose:
$query = "link";
$query = preg_quote($query, '/');
$p = '/((<)(?(2)[^>]*>)(?:.*?))*?(' . $query . ')/smi';
$r = "$1<strong>$3</strong>";
$str = ''."\n".'A Link'; // multi-line text
$nstr = preg_replace($p, $r, $str);
var_dump( $nstr );
$str = 'Its not a Link'; // non-link text
$nstr = preg_replace($p, $r, $str);
var_dump( $nstr );
Output: (view source)
string(61) "<a href="/Link/foo/the_link.htm">
A <strong>Link</strong></a>"
string(31) "Its not a <strong>Link</strong>"
PS: Above regex also takes care of multi-line replacement and more importantly it ignores matching not only href but any other HTML entity enclosed in < and >.
EDIT: If you just want to exclude hrefs and not all html entities then use this pattern instead of above in my answer:
$p = '/((<)(?(2).*?href=[^>]*>)(?:.*?))*?(' . $query . ')/smi';
I'm not 100% what you are ultimately after here, but from what I can, it's a sort of "search phrase" highlighting facility, which highlights keywords so to speak. If so, I suggest having a look at the Text Helper in CodeIgniter. It provides a nice little function called highlight_phrase and this could do what you are looking for.
The function is as follows.
function highlight_phrase($str, $phrase, $tag_open = '<strong>', $tag_close = '</strong>')
{
if ($str == '')
{
return '';
}
if ($phrase != '')
{
return preg_replace('/('.preg_quote($phrase, '/').')/i', $tag_open."\\1".$tag_close, $str);
}
return $str;
}
You may use conditional subpatterns, see explanation here: http://cz.php.net/manual/en/regexp.reference.conditional.php
preg_replace("/(?(?<=href=\")([^\"]*\")|($query))/i","\\1<strong>\\2</strong>",$x);
In your case, if you have whole HTML, not just href="", there is an easier solution using 'e' modifier, which enables you using PHP code in replacing matches
function termReplacer($found) {
$found = stripslashes($found);
if(substr($found,0,5)=="href=") return $found;
return "<strong>$found</strong>";
}
echo preg_replace("/(?:href=)?\S*$query/e","termReplacer('\\0')",$x);
See example #4 here http://cz.php.net/manual/en/function.preg-replace.php
If your expression is even more complex, you can use regExp even inside termReplacer().
There is a minor bug in PHP: the $found parameter in termReplacer() needs to be stripslashed!

preg_replace apply string function (like urlencode) in replacement

i want to parse all links in html document string in php in such way: replace href='LINK' to href='MY_DOMAIN?URL=LINK', so because LINK will be url parameter it must be urlencoded. i'm trying to do so:
preg_replace('/href="(.+)"/', 'href="http://'.$host.'/?url='.urlencode('${1}').'"', $html);
but '${1}' is just string literal, not founded in preg url, what need i do, to make this code working?
Well, to answer your question, you have two choices with Regex.
You can use the e modifier to the regex, which tells preg_replace that the replacement is php code and should be executed. This is typically seen as not great, since it's really no better than eval...
preg_replace($regex, "'href=\"http://{$host}?url='.urlencode('\\1').'\"'", $html);
The other option (which is better IMHO) is to use preg_replace_callback:
$callback = function ($match) use ($host) {
return 'href="http://'.$host.'?url='.urlencode($match[1]).'"';
};
preg_replace_callback($regex, $callback, $html);
But also never forget, don't parse HTML with regex...
So in practice, the better way of doing it (The more robust way), would be:
$dom = new DomDocument();
$dom->loadHtml($html);
$aTags = $dom->getElementsByTagName('a');
foreach ($aTags as $aElement) {
$href = $aElement->getAttribute('href');
$href = 'http://'.$host.'?url='.urlencode($href);
$aElement->setAttribute('href', $href);
}
$html = $dom->saveHtml();
Use the 'e' modifier.
preg_replace('/href="([^"]+)"/e',"'href=\"http://'.$host.'?url='.urlencode('\\1').'\"'",$html);
http://uk.php.net/preg-replace - example #4

Multiple regular expression interfere

I use regex to create html tags in plain text. like this
loop
$SearchArray[] = "/\b(".preg_quote($user['name'], "/").")\b/i";
$ReplaceArray[] = '$1';
-
$str = preg_replace($SearchArray, $ReplaceArray, $str);
I'm looking for a way to not match $user['name'] in a tag.
You could use preg_replace_callback()
for 5.3+:
$callback = function($match) using ($user) {
return ''.$match[1].'';
};
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);
for 5.2+:
$method = 'return \'\'.$match[1].\'\';';
$callback = create_function('$match', $method);
$regex = "/\b(".preg_quote($user['name'], "/").")\b/i";
$str = preg_replace_callback($regex, $callback, $string);
So the problem is that you're making several passes over the document, replacing a different user name in each pass, and you're afraid you'll unintentionally replace a name inside a tag that was created in a previous pass, right?
I would try to do all of the replacements in one pass, using preg_replace_callback as #ircmaxwell suggested, and one regex that can match any legal user name. In the callback function, you look up the matched string to see if it's a real user's name. If it is, return the generated link; if not, return the matched string for reinsertion.
It looks like you're trying to add a bunch of anchors to a document. Have you thought of using SimpleXML. This assumes that the anchor tags are part of a larger xhtml document.
//$xhtml_doc is some xhtml doc's path
$doc = simplexml_load_file($xhtml);
//NOTE: find the parent element for all these anchors (maybe with xpath)
//example: $parent = $doc->xpath('//div[#id=parent]');
foreach($user as $k => $v){
$anchor = $doc->addChild('a', $v['name']);
$anchor->addAttribute('href', $v['url']);
}
return $doc->asXML();
simpleXML helps me a lot in these situations. It'll be a lot faster than regex, even if this isn't exactly what you want to do.

Categories