Using HTML Purifier on a site with only plain text input

Using HTML Purifier on a site with only plain text input - php

I would appreciate an answer to settle a disagreement between me and some co-workers.
We have a typical PHP / LAMP web application.
The only input we want from users is plain text. We do not invite or want users to enter HTML at any point. Form elements are mostly basic input text tags. There might be a few textareas, checkboxes etc.
There is currently no sanitizing of output to pages. All dynamic content, some of which came from user input, is simply echoed to the page. We obviously need to make it safe.
My solution is to use htmlspecialchars on all output at the time it is echoed on the page.
My co-workers' solution is to add HTML Purifier to the database layer. They want to pass all user entered input through HTML Purifier before it is saved to the database. Apparently they've used it like this on other projects but I think that is a misunderstanding of what HTML Purifier is for.
My understanding is that it only makes sense to use HTML Purifier on a site which allows the user to enter HTML. It takes HTML and makes it safer and cleaner based on a whitelist and other rules.
Who's right and who's wrong?
There's also the whole "escape on input or output" issue but I guess that's a debate for another time and place.
Thanks

As a general rule, escaping should be done for context and for use-case.
If what you want to do is output plain text in an HTML context (and you do), then you need to use escaping functionality that will ensure that you will always output plain text in an HTML context. Given basic PHP, that would indeed be htmlspecialchars($yourString, ENT_QUOTES, 'yourEncoding');.
If what you want to do is output HTML in an HTML context (you don't), then you would want to santitise the HTML when you output it to prevent it from doing damage - here you would $purifier->purify($yourString); on output.
If you want to store plain text user input in a database (again, you do) by executing SQL statements, then you should either use prepared statements to prevent SQL injection, or an escaping function specific to your DB, such as mysql_real_escape_string($yourString).
You should not:
escape for HTML when you are putting data into the database
sanitise as HTML when you are putting data into the database
sanitise as HTML when you are outputting data as plain text
Of those, all are outright harmful, albeit to different degrees. Note that the following assumes the database is your only or canonical storage medium for the data (it also assumes you have SQL injection taken care of in some other way - if you don't, that'll be your primary issue):
if you escape for HTML when you put the data into the database, you rely on the guarantee that you will always be outputting the data into an HTML context; suddenly if you want to just put it into a plaintext file for printing as-is, you need to decode the data before you output it.
if you sanitise as HTML when you put the data into the database, you are destroying information that your user put there. Is it a messaging system and your user wanted to tell someone else about <script> tags? Your user can't do that - you'll destroy that part of his message!
Sanitising as HTML when you're outputting data as plain text (without also escaping it) may have confusing, page-breaking results if you don't set your sanitising module to strip all HTML (which you shouldn't, since then you clearly don't want to be outputting HTML).
Did you sanitise for a <div> context, but are putting your data into an inline element? Your user might put a <div> into your inline element, forcing a layout break into your page layout (how annoying this is depends on your layout), or to influence user perception of metadata (for example to make phishing easier), e.g. like this:
Name: John Doe(Site admin)
Did you sanitise for a <span> context? The user could use other tags to influence user perception of metadata, e.g. like this:
Name: John Doe (this user is an administrator)
Worst-case scenario: Did you sanitise your HTML with a version of HTML Purifier that later turns out to have a bug that does allow a certain kind of malicious HTML to survive? Now you're outputting untrusted data and putting users that view this data on your web page at risk.
Sanitising as HTML and escaping for HTML (in that order!) does not have this problem, but it means the sanitising step is unnecessary, meaning this constellation will just cost you performance. (Presumably that's why your colleague wanted to do the sanitising when saving the data, not when displaying it - presumably your use-case (like most) will display the data more often than the data will be submitted, meaning you would avoid having to deal with the performance hit frequently.)
tl;dr
Sanitising as HTML when you're outputting as plain text is not a good idea.
Escape / sanitise for use-case and context.
In your situation, you want to escape plain text for an HTML context (= use htmlspecialchars()).

Related

htmlentities - How do I keep my form safe in PHP?

I have a small contact form that requires a subject, a name, an email, and a message. All these fields are sanitized with:
function sanitize($string){
return htmlentities($string, ENT_QUOTES, 'UTF-8');
}
I send the email using the mail() function from PHP, and I use the subject field as the email title/subject. The problem is that in subject I have a single quote:
I'd like to explore
But after I sanitize it I get the email in the inbox, but that single quote is sanitized and appears like this:
I'd like to explore
How do I keep my form safe, but at the same time how do I get my subject of the email to be with the single quote not with that 'd-thing?

You store data using your sanitize function and if you want to display the data, you need to "desanitize" to the original format.
$str = 'I'd like to explore';
echo html_entity_decode($str, ENT_QUOTES, 'UTF-8');
To be strict, storing any data is not a security risk. However the process how the data is stored can cause security issues. Using HTML entities is usually done for storing data in databases and use it later on. But a better way to store that data is by using prepared statements which makes this completely unnecessary.
The question really is, for what storage you are sanitizing.

htmlentities is used when you want to print user-generated content as HTML on your own site. User-generated content could contain HTML-breaking symbols, so it replaces characters with HTML entities. (A wrong ' could terminate a string or if a troll inserts </div>, you entire layout of your website would break.)
That '-thing is a ' and is, from the perspective of any html-rendering-engine, an apostrophe and will be displayed as such. You can find a list of these entities online.
The reverse of that process is html_entity_decode, in order to get plain text out of it again.
In regards to your use case of keeping your form safe: htmlentities is the not the right tool for this.
If you want to send that email with plain text, keep it as plain text. It's true that you should not trust user-generated content and behavior, yet you then should safeguard against database insertion issues (prepared statements) and preventing POST-repetition on the form itself in order to not spam people.
Yet there is no inherent harm in treating user input as raw text and sending that content as raw text. (If you want to to send an HTML email with user-generated content, then htmlentities will come in handy for that as well.)

Sanitization is done within a context. In other words you should know what you are "sanitizing" against: SQL injection, html script attacks or what not. With this in mind, the correct approach is to deal with each problem separately: sanitize for the SQL injection when storing in the db, but store the data as close to the "original" as you can. This means the data in the db will be "desanitized". Then, depending on what you intend to do with the data, you process it accordingly: sending in an email? Just dump the content. Showing on the web page? html_entities it.
Note that the question of "optimization" in the sense of not having to process with a sanitizer every time you output content is different. It has to be dealt with separately, if at all needed.
This has proven itself a sound approach for me over years.

Security measures - when and how

Currently I am upgrading a web application in which I will get most of the input from logged in users. The input will contains valid html, images, audio, video & upload facilities to user defined path. The application then formats it into nice ui and displays to end users. These privileged users can add / modify / delete the content using a web based interface.
As per the basic rule of thumb: I should escape my data before entering in DB, and not to receive data receive from user. To achieve that I have planned to follow following security measures. Which also includes my questions
I am using prepared statements to store all user inputs to DB. I hope this eliminates the DB injection threat.
Is this measure enough? or do i need to check for % and _ symbols as well for mysql LIKE queries?
The user input (lets call input A), where I am not expecting any HTML/css, I use strip_tags & htmlentities before inserting in DB.
Is this adequate measure ? Should I be using more
The user input (lets call input B), in which user can have html/css tags, I user htmlentities on text then insert in DB.
As far as I am aware I should not use htmlentities before inserting in the DB, but have to as previous programmer was using it. Are there any negative impacts for this?
After fetching from DB and Before displaying the input A / input B , I am not doing any pre processing assuming, the data added to DB should be clean.
Should i process / sanitize the data before displaying ? If yes then how ?
I want to html tags enters by user to be parsed by browser and not displayed to user. e.g. if user had entered <p style='color:red;'>hello</p><p class='noclass'>world</p>, I want user to see 2 words only and not actual text.
To achieve this how can I make sure that user doesn't add malicious script and at the same time the html tags are stored, fetched and parsed by browser correctly.
Please guide if the current approach is sufficient / not sufficient / less / incorrect.
I am neither a 100% newbie to php nor I m pro. I know the basics about php (or we can say over all web applications') security. So can someone can please guide me if I am making any mistake security wise OR should not be doing something OR should be doing something more or less.
I know the basics of security but I still get confused over
Which exact security measure to apply at which exact point ? (e.g. escape string BEFORE inserting to DB)
At every point what the functions available in php? (e.g. to escape strings use prepared statements)

Yes, prepared statements are great at preventing SQL injections problems. Yes, you will have to take care of % and _ in LIKE queries, a prepared statement cannot escape them since it has no way to know whether you want those values there or not.
through 5.: It's always a bad idea to escape data going into the database for a format it's destined for on output. Why? First of all, why are you so sure you're always going to use the data in an HTML context? Maybe you'll be using it in a different format in the future, and then you'll have garbage looking data. (This is more hypothetical in your case, as you're explicitly storing HTML.)
Secondly though, your output code will have to rely on your input code to correctly have escaped data in advance, possibly with a long time between input and output. Your output code can have no confidence whatsoever that the input code did the correct job for what the output code needs it to do. Therefore, escaping for output must happen at the time of output. No sooner, no later.
Thirdly (is that a word?), strip_tags is absolutely insufficient to accept some HTML but not other "insecure" HTML. You need a more complex library which has more complex whitelisting rules than what strip_tags can do. Supposedly the only library that does that is HTML Purifier. I'd run all user HTML through it.
To summarise:
Prepared statements.
HTML-escape data that is not supposed to contain literal HTML on output.
Run any data that is supposed to contain literal HTML through HTML Purifier. Whether you do this before or after inserting to the database is up to you, depending on whether you want to store the literal input the user sent you or whether you don't mind discarding that original data immediately and storing only sanitised data instead. But, the same caveat about having confidence in your output code applies too.

validating untrusted HTML input do I have to process each input?

For Cross-site_scripting vulnerabilities
1)is it a good idea to validate and escape each and every one of the user inputs
2)is using strip_tags good enough and what's the benefit of htmlpurifier over it?

Yes this is a good idea. I would go as far as to say if you don't your are an idiot. When storing the data in a database use prepared statements and bound parameters. If you use that (like you should) you don't have to manually escape the data going into the database.
Now for displaying the data it depends what you want to allow and where you are going to output it. If it will be displayed on a HTML page and you don't want to allow any HTML to be rendered use htmlspecialchars($content, ENT_QUOTES). You almost never have to use htmlentities because that will convert ALL characters for which there is an HTML entity. Meaning it will make your document unnecessary bigger. If you want to allow some HTML you would have to filter it before displaying it (using HTML purifier).
Please note that different storage mechanisms and different output media require a different escaping / sanitizing strategy.

working with user input data in php. What's better?

I am trying to figure out what is the best way to manage the data a user inputs concerning non desirable tags he might insert:
strip_tags() - the tags are removed and they are not inserted in the database
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
What's the better, and is there any disadvantage in any of these?
Regards

This depends on what your priority is:
if it's important to display special characters from user input (like on StackOverflow, for example), then you'll need to store this information in the database and sanitize it on display - in this case, you'll want to at least use htmlspecialchars() to display the output (if not something more sophisticated)
if you just want plain text comments, use strip_tags() before you stick it in the database - this way you'll reduce the amount of data that you need to store, and reduce processing time when displaying the data on the screen

the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
This. You usually want people to be able to type less-than signs and ampersands and have them displayed as such on the page. htmlspecialchars on every text-to-HTML output step (whether that text came directly from user input, or from the database, or from somewhere else entirely) is the right way to achieve this. Messing about with the input is a not-at-all-appropriate tactic for dealing with an output-encoding issue.
Of course, you will need a different escape — or parameterisation — for putting text in an SQL string.

The measures taken to secure user input depends entirely on in what context the data is being used. For instance:
If you're inserting it into a SQL database, you should use parameterized statements. PHP's mysql_real_escape_string() works decently, as well.
If you're going to display it on an HTML page, then you need to strip or escape HTML tags.
In general, any time you're mixing user input with another form of mark-up or another language, that language's elements need to be escaped or stripped from the input before put into that context.
The last point above segues into the next point: Many feel that the original input should always be maintained. This makes a lot of sense when, later, you decide to use the data in a different way and, for instance, HTML tags aren't a big deal in the new context. Also, if your site is in some way compromised, you have a record of the exact input given.
Specifically related to HTML tags in user input intended for display on an HTML page: If there is any conceivable reason for a user to input HTML tags, then simply escape them. If not, strip them before display.

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.

One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.

You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars

For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.

A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)

My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).

You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)

If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..

The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.

OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).

Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.

there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.