Should I be using htmlspecialchars? - php

I seem to have trouble understanding when to use htmlspecialchars().
Let's say I do the following when I am inserting data:
$_POST = filter_input_array(INPUT_POST, [
'name' => FILTER_SANITIZE_STRING,
'homepage' => FILTER_DEFAULT // do nothing
]);
$course = new Course();
$course->name = trim($_POST['name']);
$course->homepage = $_POST['homepage']; // may contain unsafe HTML
$courseDAO = DAOFactory::getCourseDAO();
$courseDAO->addCourse($course); // simple insert statement
When I ouput, I do the following:
$courseDAO = DAOFactory::getCourseDAO();
$course = $courseDAO->getCourseById($_GET['id']);
?>
<?php ob_start() ?>
<h1><?= $course->name ?></h1>
<div class="homepage"><?= $course->homepage ?></div>
<?php $content = ob_get_clean() ?>
<?php include 'layout.php' ?>
I would like that $course->homepage be treated and rendered as HTML by the browser.
I've been reading answers on this question. Should I be using htmlspecialchars() anywhere here?

There are (from a security POV) three types of data that you might output into HTML:
Text
Trusted HTML
Untrusted HTML
(Note that HTML attributes and certain elements are special cases, e.g. onclick attributes expect HTML encoded JavaScript so your data needs to be HTML safe and JS safe).
If it is text, then use htmlspecialchars to convert it to HTML.
If it is trusted HTML, then just output it.
If it is untrusted HTML then you need to sanitise it to make it safe. That generally means parsing it with a DOM parser, and then removing all elements and attributes that do not appear on a whitelist as safe (some attributes may be special cased to be filtered rather than stripped), and then converting the DOM back to HTML. Tools like HTML Purifier exist to do this.
$course->homepage = $_POST['homepage']; // may contain unsafe HTML
I would like that $course->homepage be treated and rendered as HTML by the browser.
Then you have the third case and need to filter the HTML.

It looks like you're storing raw html in the database and then rendering it to the page later.
I wouldn't filter the data before you store it into the db, you risk corrupting the users input and there would be no way to retrieve the original if it were never stored.
If you want the outputted data to be treated as html by the browser then no, htmlspecialchars is not the solution.
However it is worth thinking about using striptags to remove script tags in order to combat XSS. With striptags you have to whitelist the allowable tags which is obviously tedious but pretty safe.
It might also be worth you taking a look at tinyMCE and see how they deal with such things

output plain HTML if you are sure about the contents. use htmlspecialchars on every other resources, especially for user inputs to prevent security issues.

Related

XSS vulnerabilities still exist even after using HTML Purifier

I'm testing one of my web application using Acunetix. To protect this project against XSS attacks, I used HTML Purifier. This library is recommended by most of PHP developers for this purpose, but my scan results shows HTML Purifier can not protect us from XSS attacks completely. The scanner found two ways of attack by sending different harmful inputs:
1<img sRc='http://attacker-9437/log.php? (See HTML Purifier result here)
1"onmouseover=vVF3(9185)" (See HTML Purifier result here)
As you can see results, HTML Purifier could not detect such attacks. I don't know if is there any specific option on HTML Purifier to solve such problems, or is it really unable to detect these methods of XSS attacks.
Do you have any idea? Or any other solution?
(This is a late answer since this question is becoming the place duplicate questions are linked to, and previously some vital information was only available in comments.)
HTML Purifier is a contextual HTML sanitiser, which is why it seems to be failing on those tasks.
Let's look at why in some detail:
1<img sRc='http://attacker-9437/log.php?
You'll notice that HTML Purifier closed this tag for you, leaving only an image injection. An image is a perfectly valid and safe tag (barring, of course, current image library exploits). If you want it to throw away images entirely, consider adjusting the HTML Purifier whitelist by setting HTML.Allowed.
That the image from the example is now loading a URL that belongs to an attacker, thus giving the attacker the IP of the user loading the page (and nothing else), is a tricky problem that HTML Purifier wasn't designed to solve. That said, you could write a HTML Purifier attribute checker that runs after purification, but before the HTML is put back together, like this:
// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$image = $htmlDef->addBlankElement('img');
// HTMLPurifier_AttrTransform_CheckURL is a custom class you've supplied,
// and checks the URL against a white- or blacklist:
$image->attr_transform_post[] = new HTMLPurifier_AttrTransform_CheckURL();
The HTMLPurifier_AttrTransform_CheckURL class would need to have a structure like this:
class HTMLPurifier_AttrTransform_CheckURL extends HTMLPurifier_AttrTransform
{
public function transform($attr, $config, $context) {
$destination = $attr['src'];
if (is_malicious($destination)) {
// ^ is_malicious() is something you'd have to write
$this->confiscateAttr($attr, 'src');
}
return $attr;
}
}
Of course, it's difficult to do this 'right':
if this is a live check with some web-service, this will slow purification down to a crawl
if you're keeping a local cache you run risk of having outdated information
if you're using heuristics ("that URL looks like it might be malicious based on indicators x, y and z"), you run risk of missing whole classes of malicious URLs
1"onmouseover=vVF3(9185)"
HTML Purifier assumes the context your HTML is set in is a <div> (unless you tell it otherwise by setting HTML.Parent).
If you just feed it an attribute value, it's going to assume you're going to output this somewhere so the end-result looks like this:
...
<div>1"onmouseover=vVF3(9185)"</div>
...
That's why it appears to not be doing anything about this input - it's harmless in this context. You might even not want to strip this information in that context. I mean, we're talking about this snippet here on stackoverflow, and that's valuable (and not causing a security problem).
Context matters. Now, if you instead feed HTML Purifier this snippet:
<div class="1"onmouseover=vVF3(9185)"">foo</div>
...suddenly you can see what it's made to do:
<div class="1">foo</div>
Now it's removed the injection, because in this context, it would have been malicious.
What to use HTML Purifier for and what not
So now you're left to wonder what you should be using HTML Purifier for, and when it's the wrong tool for the job. Here's a quick run-down:
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') (or whatever your encoding is) if you're outputting into a HTML document and aren't interested in preserving HTML at all - it's unnecessary overhead and it'll let some things through
you should use HTML Purifier if you want to output into a HTML document and allow formatting, e.g. if you're a message board and you want people to be able to format their messages using HTML
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') if you're outputting into a HTML attribute (HTML Purifier is not meant for this use-case)
You can find some more information about sanitising / escaping by context in this question / answer.
All the HTML purifier seems to be doing, from the brief look that I gave, was HTML encode certain characters such as <, > and so on. However there are other means of invoking JS without using the normal HTML characters:
javascript:prompt(1) // In image tags
src="http://evil.com/xss.html" // In iFrame tags
Please review comments (by #pinkgothic) below.
Points below:
This would be HTML injection which does effectively lead to XSS. In this case, you open an <img> tag, point the src to some non-existent file which in turn raises an error. That can then be handled by the onerror handler to run some JavaScript code. Take the following example:
<img src=x onerror=alert(document.domain)>
The entrypoint for this it generally accompanied by prematurely closing another tag on an input. For example (URL decoded for clarity):
GET /products.php?type="><img src=x onerror=prompt(1)> HTTP/1.1
This however, is easily mititgated by HTML escaping meta-character (i.e. <, >).
Same as above, except this could be closing off an HTML attribute instead of a tag and inserting its own attribute. Say you have a page where you can upload the URL for an image:
<img src="$USER_DEFINED">
A normal example would be:
<img src="http://example.com/img.jpg">
However, inserting the above payload, we cut off the src attribute which points to a non-existent file and inject an onerror handler:
<img src="1"onerror=alert(document.domain)">
This executes the same payload mentioned above.
Remediation
This is heavily documented and tested in multiple places, so I won't go into detail. However, the following two articles are great on the subject and will cover all your needs:
https://www.acunetix.com/websitesecurity/cross-site-scripting/
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

PHP XSS sanitization

Questions:
What are the best safe1(), safe2(), safe3(), and safe4() functions to avoid XSS for UTF8 encoded pages? Is it also safe in all browsers (specifically IE6)?
<body><?php echo safe1($xss)?></body>
<body id="<?php echo safe2($xss)?>"></body>
<script type="text/javascript">
var a = "<?php echo safe3($xss)?>";
</script>
<style type="text/css">
.myclass {width:<?php echo safe4($xss)?>}
</style>
.
Many people say the absolute best that can be done is:
// safe1 & safe2
$s = htmlentities($s, ENT_QUOTES, "UTF-8");
// But how would you compare the above to:
// https://github.com/shadowhand/purifier
// OR http://kohanaframework.org/3.0/guide/api/Security#xss_clean
// OR is there an even better if not perfect solution?
.
// safe3
$s = mb_convert_encoding($s, "UTF-8", "UTF-8");
$s = htmlentities($s, ENT_QUOTES, "UTF-8");
// How would you compare this to using using mysql_real_escape_string($s)?
// (Yes, I know this is a DB function)
// Some other people also recommend calling json_encode() before passing to htmlentities
// What's the best solution?
.
There are a hell of a lot of posts about PHP and XSS.
Most just say "use HTMLPurifier" or "use htmlspecialchars", or are wrong.
Others say use OWASP -- but it is EXTREMELY slow.
Some of the good posts I came across are listed below:
Do htmlspecialchars and mysql_real_escape_string keep my PHP code safe from injection?
XSS Me Warnings - real XSS issues?
CodeIgniter - why use xss_clean
safe2() is clearly htmlspecialchars()
In place of safe1() you should really be using HTMLPurifier to sanitize complete blobs of HTML. It strips unwanted attributes, tags and in particular anything javascriptish. Yes, it's slow, but it covers all the small edge cases (even for older IE versions) which allow for safe HTML user snippet reuse. But check out http://htmlpurifier.org/comparison for alternatives. -- If you really only want to display raw user text there (no filtered html), then htmlspecialchars(strip_tags($src)) would actually work fine.
safe3() screams regular expression. Here you can really only apply a whitelist to whatever you actually want:
var a = "<?php echo preg_replace('/[^-\w\d .,]/', "", $xss)?>";
You can of course use json_encode here to get a perfectly valid JS syntax and variable. But then you've just delayed the exploitability of that string into your JS code, where you then have to babysit it.
Is it also safe in all browsers (specifically IE6)?
If you specify the charset explicitly, then IE won't do its awful content detection magic, so UTF7 exploits can be ignored.
http://php.net/htmlentities note the section on the optional third parameter that takes a character encoding. You should use this instead of mv_convert_encoding. So long as the php file itself is saved with a utf8 encoding that should work.
htmlentities($s, ENT_COMPAT, 'UTF-8');
As for injecting the variable directly into javascript, you might consider putting the content into a hidden html element somewhere else in the page instead and pulling the content out of the dom when you need it.
The purifiers that you mention are used when you want to actually display html that a user submitted (as in, allow the browser to actually render). Using htmlentities will encode everything such that the characters will be displayed in the ui, but none of the actual code will be interpreted by the browser. Which are you aiming to do?

A PHP Function that verify code language

I have a form with 2 textareas; the first one allows user to send HTML Code, the second allows to send CSS Code. I have to verify with a PHP function, if the language is correct.
If the language is correct, for security, i have to check that there is not PHP code or SQL Injection or whatever.
What do you think ? Is there a way to do that ?
Where can I find this kind of function ?
Is "HTML Purifier" http://htmlpurifier.org/ a good solution ?
If you have to validate the date to insert them in to database - then you just have to use mysql_real_escape_string() function before inserting them in to db.
//Safe database insertion
mysql_query("INSERT INTO table(column) VALUES(".mysql_real_escape_string($_POST['field']).")");
If you want to output the data to the end user as plain text - then you have to escape all html sensitive chars by htmlspecialchars(). If you want to output it as HTML, the you have to use HTML Purify tool.
//Safe plain text output
echo htmlspecialchars($data, ENT_QUOTES);
//Safe HTML output
$data = purifyHtml($data); //Or how it is spiecified in the purifier documentation
echo $data; //Safe html output
for something primitive you can use regex, BUT it should be noted using a parser to fully-exhaust all possibilities is recommended.
/(<\?(?:php)?(.*)\?>)/i
Example: http://regexr.com?2t3e5 (change the < in the expression back to a < and it will work (for some reason rexepr changes it to html formatting))
EDIT
/(<\?(?:php)?(.*)(?:\?>|$))/i
That's probably better so they can't place php at the end of the document (as PHP doesn't actually require a terminating character)
SHJS syntax highlighter for Javascript have files with regular expressions http://shjs.sourceforge.net/lang/ for languages that highlights — You can check how SHJS parse code.
HTMLPurifier is the recommended tool for cleaning up HTML. And as luck has it, it also incudes CSSTidy and can sanitize CSS as well.
... that there is not PHP code or SQL Injection or whatever.
You are basing your question on a wrong premise. While HTML can be cleaned, this is no safeguard against other exploitabilies. PHP "tags" are most likely to be filtered out. If you are doing something other weird (include-ing or eval-ing the content partially), that's no real help.
And SQL exploits can only be prevented by meticously using the proper database escape functions. There is no magic solution to that.
Yes. htmlpurifier is a good tool to remove malicious scripts and validate your HTML. Don't think it does CSS though. Apparently it works with CSS too. Thanks Briedis.
Ok thanks you all.
actually, i realize that I needed a human validation. Users can post HTML + CSS, I can verify in PHP that the langage & the syntax are correct, but it doesn't avoid people to post iframe, html redirection, or big black div that take all the screen.
:-)

How to restrict or limit the html tags a user can enter in a web form, pref. client side?

What are good options to restrict the type of html tags a user is allowed to enter into a form field? I'd like to be able to do that client side (presumably using JavaScript), server-side in PHP if it's too heavy for the user's browser, and possibly a combo of both if appropriate.
Effectively I'd like users to be able to submit data with the same tag-set as on Stackoverflow, plus maybe the standard MathML tags. The form must accept UTF-8 text, including Asian ideograms, etc.
In the application, the user must be able to submit text-entries with basic html tags, and those entries must be able to be displayed to (potentially different) users with the html rendered correctly in a way that is safe to the users. I'm planning to use htmlspecialchars() and htmlspecialchars_decode() to protect my db server-side.
Many thanks,
JDelage
PS: I searched but couldn't find this question...
If you're looking to filter input agains XSS attacks etc., consider using an existing library like HTML Purifier. I've not used it myself yet but it promises a lot and is in high regard.
HTML Purifier is a standards-compliant
HTML filter library written in
PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited,
secure yet permissive whitelist,
it will also make sure your documents are
standards compliant, something only achievable with a
comprehensive knowledge of W3C's specifications.
I think is way easy to use strip_tags and just specify the tags you are allowing.
You could do something like this, if you are familiar with regular expressions:
<?php
function parse($string)
{
//To stop unwanted HTML tags being used
$string = str_replace("<","<",$string); //Replace all < with the HTML equiv
$string = str_replace(">",">",$string); //Replace all > with the HTML equiv
$find = array(
"%\*\*\*(.+?)\*\*\*%s", //Search for ***any string here***
"%`(.+?)`%s", //Search for `any string here`
);
$replace = array(
"<b>\\1</b>", //Replace with <b>any string here</b>
"<span style=\"background-color: #DDDDDD\">\\1</span>" //Replace with <span style="background-color: #DDDDDD">any string here</span>
);
$string = preg_replace($find,$replace,$string); //Do the find and replace
return $string; //Return the output
}
echo parse("***Hello*** `There` <b>Friend</b>");
?>
Outputs:
Hello There <b>Friend</b>
I had similar problem for some time. There were some $%^&*) who liked to post some comments like <script>alert('Hello');</script> or something like that. I got tired and made a small function, which helped me, to allow, only <br> or <br /> tags for normal view of message.
I did it only in PHP, but I think it might help you.
function eliminateTags($msg) {
$setBrakes = nl2br($msg);
$decodeHTML = htmlspecialchars_decode($setBrakes);
# Check PHP version
if(version_compare(PHP_VERSION, '5.2') == 1) {
$withoutTags = strip_tags($decodeHTML, "<br />");
} else {
$withoutTags = strip_tags($decodeHTML, "<br>");
}
return $withoutTags;
}

PHP HTML Entities

I want to display on screen data send by the user,
remembering it can contain dangerous code, it is the best to clean this data with html entities.
Is there a better way to do html entities, besides this:
$name = clean($name, 40);
$email = clean($email, 40);
$comment = clean($comment, 40);
and this:
$data = array("name", "email," "comment")
function confHtmlEnt($data)
{
return htmlentities($data, ENT_QUOTES, 'UTF-8');
}
$cleanPost = array_map('confHtmlEnt', $_POST);
if so, how, and how does my wannabe structure
for html entities look?
Thank you for not flaming the newb :-).
"Clean POST", the only problem is you might not know in what context will your data appear. I have a Chat server now that works via browser client and a desktop client and both need data in a different way. So make sure you save the data as "raw" as possible into the DB and then worry about filtering it on output.
Do not encode everything in $_POST/$_GET. HTML-escaping is an output-encoding issue, not an input-checking one.
Call htmlentities (or, usually better, htmlspecialchars) only at the point where you're taking some plain text and concatenating or echoing it into an HTML page. That applies whether the text you are using comes from a submitted parameter, or from the database, or somewhere else completely. Call mysql_real_escape_string only at the point you insert plain text into an SQL string literal.
It's tempting to shove all that escaping stuff in its own box at the top of the script and then forget about it. But text preparation really doesn't work like that, and if you pretend it does you'll find your database irreparably full of double-encoded crud, backslashes on your HTML page and security holes you didn't spot because you were taking data from a source other than the (encoded) parameters.
You can make the burden of remembering to mysql_real_escape_string go away by using mysqli's parameterised queries or another higher-level data access layer. You can make the burden of typing htmlspecialchars every time less bothersome by defining a shorter-named function for it, eg.:
<?php
function h($s) {
echo(htmlspecialchars($s, ENT_QUOTES));
}
?>
<h1> Blah blah </h1>
<p>
Blah blah <?php h($title); ?> blah.
</p>
or using a different templating engine that encodes HTML by default.
If you wish to convert the five special HTML characters to their equivalent entities, use the following method:
function filter_HTML($mixed)
{
return is_array($mixed)
? array_map('filter_HTML',$mixed)
: htmlspecialchars($mixed,ENT_QUOTES);
}
That would work for both UTF-8 or single-byte encoded string.
But if the string is UTF-8 encoded, make sure to filter out any invalid characters sequence, prior to using the filter_HTML() function:
function make_valid_UTF8($str)
{
return iconv('UTF-8','UTF-8//IGNORE',$str)
}
Also see: http://www.phpwact.org/php/i18n/charsets#character_sets_character_encoding_issues
You need to clean every element bevor displaying it. I do it usually with a function and an array like your secound example.
If you use a framework with a template engine, there is quite likely a possibility to auto-encode strings. Apart from that, what's simpler than calling a function and getting the entity-"encoded" string back?
Check out the filter libraries in php, in particular filter_input_array.
filter_input_array(INPUT_POST, FILTER_SANITIZE_SPECIAL_CHARS);

Categories