HTML Purifier - what to purify?

HTML Purifier - what to purify? - php

I am using HTML Purifier to protect my application from XSS attacks. Currently I am purifying content from WYSIWYG editors because that is the only place where users are allowed to use XHTML markup.
My question is, should I use HTML Purifier also on username and password in a login authentication system (or on input fields of sign up page such as email, name, address etc)? Is there a chance of XSS attack there?

You should Purify anything that will ever possibly be displayed on a page. Because with XSS attacks, hackers put in <script> tags or other malicious tags that can link to other sites.
Passwords and emails should be fine. Passwords should never be shown and emails should have their own validator to make sure that they are in the proper format.
Finally, always remember to put in htmlentities() on content.
Oh .. and look at filter_var aswell. Very nice way of filtering variables.

XSS risks exist where ever data entered by one user may be viewed by other users. Even if this data isn't currently viewable, don't assume that a need to do this won't arise.
As far as the username and password go, you should never display a password, or even store it in a form that can be displayed (i.e. encyrpt it with sha1()). For usernames, have a restriction on legal characters like [A-Za-z0-9_]. Finally, as the other answer suggests, use your languages html entity encoding function for any entered data that may contain reserved or special html characters, which prevents this data from causing syntax errors when displayed.

No, I wouldn't use HTMLPurifier on username and password during login authentication. In my appllications I use alphanumeric usernames and an input validation filter and display them with htmlspecialchars with ENT_QUOTES. This is very effective and a hell lot faster than HTMLpurifier. I'm yet to see an XSS attack using alphanumeric string. And BTW HTMLPurifier is useless when filtering alphanumeric content anyway so if you force the input string through an alphanumeric filter then there is no point to display it with HTMLpurifier. When it comes to passwords they should never be displayed to anybody in the first place which eliminates the possibility of XSS. And if for some perverse reason you want to display the passwords then you should design your application in such a way that it allows only the owner of the password to be able to see it, otherwise you are screwed big time and XSS is the least of your worry!

HTML Purifier takes HTML as input, and produces HTML as output. Its purpose is to allow the user to enter html with some tags, attributes, and values, while filtering out others. This uses a whitelist to prevent any data that can contain scripts. So this is useful for something like a WYSIWYG editor.
Usernames and passwords on the other hand are not HTML. They're plain text, so HTML purifier is not an option. Trying to use HTML Purifier here would either corrupt the data, or allow XSS attacks.
For example, it lets the following through unchanged, which can cause XSS issues when inserted as an attribute value in some elements:
" onclick="javascript:alert()" href="
Or if someone tried to use special symbols in their password, and entered:
<password
then their password would become blank, and make it much easier to guess.
Instead, you should encode the text. The encoding required depends on the context, but you can use htmlentities when outputting these values if you stick to rule #0 and rule #1, at the OWASP XSS Prevention Cheat Sheet

Related

Using HTML Purifier on a site with only plain text input

I would appreciate an answer to settle a disagreement between me and some co-workers.
We have a typical PHP / LAMP web application.
The only input we want from users is plain text. We do not invite or want users to enter HTML at any point. Form elements are mostly basic input text tags. There might be a few textareas, checkboxes etc.
There is currently no sanitizing of output to pages. All dynamic content, some of which came from user input, is simply echoed to the page. We obviously need to make it safe.
My solution is to use htmlspecialchars on all output at the time it is echoed on the page.
My co-workers' solution is to add HTML Purifier to the database layer. They want to pass all user entered input through HTML Purifier before it is saved to the database. Apparently they've used it like this on other projects but I think that is a misunderstanding of what HTML Purifier is for.
My understanding is that it only makes sense to use HTML Purifier on a site which allows the user to enter HTML. It takes HTML and makes it safer and cleaner based on a whitelist and other rules.
Who's right and who's wrong?
There's also the whole "escape on input or output" issue but I guess that's a debate for another time and place.
Thanks

As a general rule, escaping should be done for context and for use-case.
If what you want to do is output plain text in an HTML context (and you do), then you need to use escaping functionality that will ensure that you will always output plain text in an HTML context. Given basic PHP, that would indeed be htmlspecialchars($yourString, ENT_QUOTES, 'yourEncoding');.
If what you want to do is output HTML in an HTML context (you don't), then you would want to santitise the HTML when you output it to prevent it from doing damage - here you would $purifier->purify($yourString); on output.
If you want to store plain text user input in a database (again, you do) by executing SQL statements, then you should either use prepared statements to prevent SQL injection, or an escaping function specific to your DB, such as mysql_real_escape_string($yourString).
You should not:
escape for HTML when you are putting data into the database
sanitise as HTML when you are putting data into the database
sanitise as HTML when you are outputting data as plain text
Of those, all are outright harmful, albeit to different degrees. Note that the following assumes the database is your only or canonical storage medium for the data (it also assumes you have SQL injection taken care of in some other way - if you don't, that'll be your primary issue):
if you escape for HTML when you put the data into the database, you rely on the guarantee that you will always be outputting the data into an HTML context; suddenly if you want to just put it into a plaintext file for printing as-is, you need to decode the data before you output it.
if you sanitise as HTML when you put the data into the database, you are destroying information that your user put there. Is it a messaging system and your user wanted to tell someone else about <script> tags? Your user can't do that - you'll destroy that part of his message!
Sanitising as HTML when you're outputting data as plain text (without also escaping it) may have confusing, page-breaking results if you don't set your sanitising module to strip all HTML (which you shouldn't, since then you clearly don't want to be outputting HTML).
Did you sanitise for a <div> context, but are putting your data into an inline element? Your user might put a <div> into your inline element, forcing a layout break into your page layout (how annoying this is depends on your layout), or to influence user perception of metadata (for example to make phishing easier), e.g. like this:
Name: John Doe(Site admin)
Did you sanitise for a <span> context? The user could use other tags to influence user perception of metadata, e.g. like this:
Name: John Doe (this user is an administrator)
Worst-case scenario: Did you sanitise your HTML with a version of HTML Purifier that later turns out to have a bug that does allow a certain kind of malicious HTML to survive? Now you're outputting untrusted data and putting users that view this data on your web page at risk.
Sanitising as HTML and escaping for HTML (in that order!) does not have this problem, but it means the sanitising step is unnecessary, meaning this constellation will just cost you performance. (Presumably that's why your colleague wanted to do the sanitising when saving the data, not when displaying it - presumably your use-case (like most) will display the data more often than the data will be submitted, meaning you would avoid having to deal with the performance hit frequently.)
tl;dr
Sanitising as HTML when you're outputting as plain text is not a good idea.
Escape / sanitise for use-case and context.
In your situation, you want to escape plain text for an HTML context (= use htmlspecialchars()).

Saving textarea in the database

I've been searching about this, but I can't find the most important part - what field to use.
I want to save a textarea without allowing any kind of javascript, html or php. What functions should I run the posted textarea through before saving it in the database? And what field type should I use for it in the database? It'll be a description, max 1000 chars.

There are a number of ways to go around in removing/handling code so that it can be saved in your database.
Regular Expressions
One way (but may be hard and unreliable) is to remove/ detect code using regular expressions.
For example, the following removes all script tags using php code (Taken from here):
$mystring = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $mystring)
The stip_tags PHP function
You can also make use of the built in stip_tags function which strips HTML and PHP tags from a string. The manual provides several examples, one shown below for your convenience:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
HTML Purifier
You can check out HTML Purifier, which is a common HTML filter PHP library intended to detect and remove dangerous code.
Simple code found on their Getting Started Section:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
In Practice (Safe Output)
If you are trying to avoid XSS attacks or Injection attacks, cleaning user data is the wrong way to go about it. Removing tags is not a 100 % guarantee for keeping your service safe from these attacks. Therefore, in practice, user data containing code is not usually filtered/ cleaned, but rather escaped during output. More specifically, the special characters within the string are escaped, where these characters are based on the syntax of the language. An example of this is making use of PHP's htmlspecialchars function in order to convert special characters to their respective HTML entities. A Code Snippet taken from manual is shown below:
<?php
$new = htmlspecialchars("<a href='test'>Test</a>", ENT_QUOTES);
echo $new; // <a href='test'>Test</a>
?>
For more information about escaping and a very good explanation related to your question, look at this page. It shows you other forms of output escaping. Also, for a question and answer related to escaping, click here.
Furthermore, one more short but VITAL point I want to throw at you is that ANY data received from a user CANNOT be trusted.
SQL Injection Attacks
Definition (From here)
A SQL injection attack consists of insertion or "injection" of a SQL
query via the input data from the client to the application. A
successful SQL injection exploit can read sensitive data from the
database, modify database data (Insert/Update/Delete), execute
administration operations on the database (such as shutdown the DBMS),
recover the content of a given file present on the DBMS file system
and in some cases issue commands to the operating system.
For SQL Injection attacks: Use prepared statements and parameterized queries when storing information to the database. (Question and Answer found here) A tutorial of prepared statements using PDO can be found here.
Cross-site Scripting (XSS)
Definition (from here):
Cross-Site Scripting attacks are a type of injection problem, in which
malicious scripts are injected into the otherwise benign and trusted
web sites. Cross-site scripting (XSS) attacks occur when an attacker
uses a web application to send malicious code, generally in the form
of a browser side script, to a different end user.
I personally like this image for a better understanding.
For XSS attacks: you should consult this famous page, which describes rule by rule on what needs to be done.

TLDR:
It is conventional to use htmlspecialchars() to encode text on output, rather than filter the text on input. A text field is fine for this purpose.
What you need to defend against
You are trying to protect yourself from XSS. XSS happens when users can stored HTML control characters on your site. Other users will see this HTML markup, so a malicious user can use your page to redirect people to other sites or steal cookies and so on.
You need to consider this for all of your inputs: this should include any varchar or text field that can be stored in your database; not just your textareas. I can add malicious content to an input field just as easily as I can add it to a textarea.
How do we defend against this?
Let's say that a user claims that their username is:
<script src="http://example.com/malicious.js"></script>
The simplest way to handle this is to save this into the database "as is". However, whenever you echo it on the site, you should filter it through the PHP htmlspecialchars() function:
echo 'Hi, my name is ' . htmlspecialchars($user->username) . '!';
htmlspecialchars turns the HTML control characters (<, >, &, ', and ") into their HTML Entities (<, >, &, &apos;, and "). This would look like the original character in a browser (i.e.: to normal users), but it would not act like actual HTML markup.
The result is that instead of malicious JavaScript, the user's name would literally look like <script src="http: //example.com/malicious.js"></script>.
Why filter on output? Why not on input?
1 - OWASP recommends this way
2 - If you forget to protect an input field, and someone figures it out and adds malicious content, you now need to find the malicious content in the database and repair the fault code on your site.
3 - If you forget to encode an output field, and someone manages to sneak in malicious input, then you only need to repair the faulty code on your site.
4 - It is possible for users to write usernames that would break the HTML fields used to edit the usernames. If you encode the content before you store it in the database, then you need to display it "as is" in the appropriate input fields (let's assume that an admin or the user can change their username later). But, let's suppose that a user found a way to inject malicious code into the database. What if they said that their username is: " style="display:none;" />. The input field that would let the administrator change this username now looks like:
<input type="text" name="username" value="" style="display:none;" />" />
malicious content -> ^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, the admins can't fix the problem: the input field has disappeared. But, if you encode the text on output, then all of your input fields will have protection against malicous content. Now, your inputs will look like this:
<input type="text" name="username" value="" style="display:none;" />" />
safe content -> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In my specific PHP app, what else can I do against XSS vulnerabilities?

I have read OWASP's XSS prevention cheat sheet but I don't really recognize my application with those rules. I don't feel like I have any of the vulnerabilities pointed out in those rules.
I am doing a PHP application that follows all the following principles:
Not a single user input is displayed directly on the HTML page without being processed and sanitized on the server-side
All my user input are sanitized with htmlentities(). Is that sufficient? (I use prepared statements for SQL injection)
Some of the user input have a maxlength condition of 5 characters on server-side. Does that help protect against XSS? (since I hardly see an XSS code being shorter than 6 characters)
Apart from data from the database, the only user input that is displayed back to the user was sent to the server via ajax, sanitized with htmlentities and reintroduced in the DOM using text() instead of html() (using jQuery)
Should I be concerned about XSS in my case? What else can I do to protect myself from XSS?

All my user input are sanitized with htmlentities(). Is that sufficient? (I use prepared statements for SQL injection)
No. First, you should filter on output, not on input. In programming never trust any data, even those from your own database! On input, you just need to escape it for use in SQL, logs, etc. But you also have to filter basic html + some special characters: \0 & < > ( ) + - = " ' \ on output. htmlentities() is just not enough.
Imagine you have a image on site:
<img src="xxx" onload="image_loaded({some_text_from_db});">
{some_text_from_db} would be );alert(String.fromCharCode(58,53,53)
If you escape it just with htmlentities it will become:
<img src="" onload="image_loaded();alert(String.fromCharCode(58,53,53));">
Some of the user input have a maxlength condition of 5 characters on server-side. Does that help protect against XSS? (since I hardly see an XSS code being shorter than 6 characters)
Always check data on server side, if you want also on client side, its ok, but always do it also on server side. Many modern browsers (chrome,ff,opera) allows user to edit page "on the fly" so they can easily remove the maxlength attribute.
Apart from data from the database, the only user input that is displayed back to the user was sent to the server via ajax, sanitized with htmlentities and reintroduced in the DOM using text() instead of html() (using jQuery)
From .text() jquery documentation:
We need to be aware that this method escapes the string provided as necessary so that it will render correctly in HTML. To do so, it calls the DOM method .createTextNode(), which replaces special characters with their HTML entity equivalents (such as < for <).
So probably yes, it should be enough but be aware of escaping from text() like in example above.
Your application filtering should look like this:
INPUT USER -> FILTER -> APPLICATION
OUTPUT APPLICATION -> FILTER -> USER
Not just input filtering.

I suggest using HTMLawed or HTMLPurifier for user input that needs to be displayed as HTML, or just completely stripping all HTML from user input that shouldn't contain it anyway. HTMLPurifier is the more powerful of the two, and I've never had any XSS issues in any projects with which I have used it.

Protection against XSS exploits?

I'm newish to PHP but I hear XSS exploits are bad. I know what they are, but how do I protect my sites?

To prevent from XSS attacks, you just have to check and validate properly all user inputted data that you plan on using and dont allow html or javascript code to be inserted from that form.
Or you can you Use htmlspecialchars() to convert HTML characters into HTML entities. So characters like <> that mark the beginning/end of a tag are turned into html entities and you can use strip_tags() to only allow some tags as the function does not strip out harmful attributes like the onclick or onload.

Escape all user data (data in the database from user) with htmlentities() function.
For HTML data (for example from WYSIWYG editors), use HTML Purifier to clean the data before saving it to the database.

strip_tags() if you want to have no tags at all. Meaning anything like <somthinghere>
htmlspecialchars() would covert them to html so the browser will only show and not try to run.
If you want to allow good html i would use something like htmLawed or htmlpurifier

The bad news
Unfortunately, preventing XSS in PHP is a non-trivial undertaking.
Unlike SQL injection, which you can mitigate with prepared statements and carefully selected white-lists, there is no provably secure way to separate the information you are trying to pass to your HTML document from the rest of the document structure.
The good news
However, you can mitigate known attack vectors by being particularly cautious with your escaping (and keeping your software up-to-date).
The most important rule to keep in mind: Always escape on output, never on input. You can safely cache your escaped output if you're concerned about performance, but always store and operate on the unescaped data.
XSS Mitigation Strategies
In order of preference:
If you are using a templating engine (e.g. Twig, Smarty, Blade), check that it offers context-sensitive escaping. I know from experience that Twig does. {{ var|e('html_attr') }}
If you want to allow HTML, use HTML Purifier. Even if you think you only accept Markdown or ReStructuredText, you still want to purify the HTML these markup languages output.
Otherwise, use htmlentities($var, ENT_QUOTES | ENT_HTML5, $charset) and make sure the rest of your document uses the same character set as $charset. In most cases, 'UTF-8' is the desired character set.
Why shouldn't I filter on input?
Attempting to filter XSS on input is premature optimization, which can lead to unexpected vulnerabilities in other places.
For example, a recent WordPress XSS vulnerability employed MySQL column truncation to break their escaping strategy and allow the prematurely escaped payload to be stored unsafely. Don't repeat their mistake.

Users entering ampersand & character messing up my sites w3c validation

my social networking site is w3c xhtml valid however users are able to post blog reports and stuff and at times enter in ampersand characters which in turn mess up my validation. How can I fix this and are there any other single characters that I need to look out for that could mess up my validation?

When displaying user produced content, run it through the htmlspecialchars() function.

As a matter of general principle it's a mistake to include user-submitted (or indeed any external) content into your page directly without validation or filtering. Besides causing validation errors it can also cause "broken pages" and large security holes (cross-site scripting attacks).
Whenever you get data from anywhere that isn't 100% trusted, you need to make it safe in some way. You can do this by doing some or all of:
Escaping textual data so that special characters are replaced by the HTML entities that represent them.
Stripping or filtering unsafe HTML tags.
Validating that HTML doesn't contain any unsafe or illegal constructs.
If your user input is meant to be interpreted as text then you're mostly looking at option 1; if you're letting the users use HTML then you're looking at options 2 and 3. A fourth option is to have the users use some more restrictive non-HTML markup such as Markdown or bbCode, translating between that markup and HTML using a library that (hopefully) doesn't allow the injection of security holes, page-breaking constructs, or other scary things.

It's a bad idea to allow users to enter HTML markup.
This enables all kinds of nasty things, most notably cross-site scripting (XSS) exploits and injection of hidden spam (hidden from you, not search engine bots).
You should:
Obliterate all HTML tags using htmlspecialchars() and only preserve newlines with nl2br(). You might allow some formatting by implementing your own safe markup that allows only very specific tags (things like phpBB or Wiki-like markup).
Use HTML Purifier to reliably eliminate all potentially-dangerous markup. PHP's strip_tags() function is fundamentally broken and allows dangerous code in attributes if you use whitelist argument.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.