Confused with html encoding

Confused with html encoding - php

I am getting confused with character encoding.
I understand people do things differently, but many suggest you should store your input in the database as it is entered, then deal with it when you are reading it in accordance with what you are planning to do with it. This makes sense to me.
So, if a user enters an apostrophe, double quote or ampersand, less than, greater than sign, these will be written in my database as ' " & < > respectively.
Now, reading the data with php, I am running the text through HTMLPurify to catch any injection issues.
Should I also htmlencode it? If I don't, it all appears OK (in Chrome and Firefox) but I am not sure if this is correct and will it display properly in other browsers?
If I use htmlentities with ENT_QUOTES, and htmlspecialchars, I start getting the codes coming through for these characters, which I believe is what I should see if looking at the page source, but not on the page the user sees.
The problem is, without doing the encoding, I am seeing what I want to see, but have this niggle in my mind, that I am not doing it correctly!

You have this confused. Character encoding is an attribute of YOUR systems. Your websites and your database are responsible for character encoding.
You have to decide what you will accept. I would say in general, the web has moved towards standardization on UTF-8. So if your websites that accept user input AND your database, and all connections involved are UTF-8, then you are in a position to accept input as UTF-8, and your character set and collation in the database should be configured appropriately.
At this point all your web pages should be HTML5, so the recommended HEAD section of your pages should at a minimum be this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
Next you have SQL injection. You specified PHP. If you are using mysqli or PDO (which is in my experience the better choice) AND you are using bindParameter for all your variables, there is NO ISSUE with SQL injection. That issue goes away, and the need for escaping input goes away, because you no longer have to be concerned that a SQL statement could get confused. It's not possible anymore.
Finally, you mentioned htmlpurifier. That exists so that people can try and avoid XSS and other exploits of that nature, that occur when you accept user input, and those people inject html & js.
That is always going to be a concern, depending on the nature of the system and what you do with that output, but as others suggested in comments, you can run sanitizers and filters on the output after you've retrieved it from the database. Sitting inside a php string variable there is no intrinsic danger, until you weaponize it by injecting it into a live html page you are serving.
In terms of finding bad actors and people trying to mess with your system, you are obviously much better off having stored the original input as submitted. Then as you come to understand the nature of these exploits, you can search through your database looking for specific things, which you won't be able to do if you sanitize first and store the result.

Related

Zend\escap data before inserting into database

I'm adding some xss protection to the website I'm working on, the platform is zendFrameWork 2 and therefor I'm using Zend\escaper. from zend documentation i knew that:
Zend\Escaper is meant to be used only for escaping data that is to be
output, and as such should not be misused for filtering input data.
For such tasks, the Zend\Filter component, HTMLPurifier.
but what are the riskes if i escaped the data before inserting it into the database, am i so wrong to do that? please explane to me as im somehow new to this topic.
thanks

When encoding data before storing it you will have to decode it before you can do anything sensible with it before outputting it. That's why I'd not do it.
Let's say you have an international application and you want to store the escaped value of a form field which might contain any NON-ASCII characters those might become escaped into HTML-Entities. So what if you have to quantify the content of that field? Like counting the characters? You will always have to de-escape the content before counting it. and then you have to re-escape it again. Much work done but nothing gained.
The same applies to search-operations in your database. You will have to escape the search-phrase the same way then your input for the database to understand what you are looking for.
I'd use one character-set throughout the application and database (I prefer UTF-8, beware of the MySQL-Connection....) and only escape content on output. Thant way I can then do whatever I like with the data and are on the safe side on output. And escaping is done in my view-layer automaticaly so I don't even have to think about it every time I handle data as it works automaticaly. That way you can't forget it.
That does not prevent me from filtering and sanitizing the input. And it doesn't prevent me from escaping the database-content using the appropriate database-escaping mechanisms like mysqli_real_escape_string or similar or using prepared statements!
But that's just my opinion, others might think otherwise!

"Output" here refers to the web page. A form field ( HTML tag) is an INPUT (from the webpage), any text is an OUTPUT (to the webpage). You need to ensure any output (to the webpage) does not contain dangerous characters that could be used to forge XSS attack vectors.
This said, if you have DANGEROUS_INPUT_X given by the user and then
$NOT_DANGEROUS_ANYMORE = ZED.HtmlPurifier(DANGEROUS_INPUT_X)
DBSave($NOT_DANGEROUS_ANYMORE)
and somewhere else
$OUTPUT = DBLoad($NOT_DANGEROUS_ANYMORE)
echo $OUTPUT
you should be fine, as long as you do not apply any additional encoding/decoding to this output. It will be displayed in the way it is saved, that was safe.
I would suggest to look at output encoding more than validation: HtmlPurifier cleans the HTML, while you could accept any kind of bad characters if you ensure your output is encoded in the page.
Here https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet some general rules, here the PHP example
echo htmlspecialchars($DANGEROUS_INPUT_X_NOW_OUTPUT, ENT_QUOTES, "UTF-8");
Remember to set the Character Set and be consistent with the same one throughout your pages/scripts/binaries and in the database as well.

How to prevent XSS attack with Zend Form using %

our company has made a website for our client. The client hired a webs security company to test the pages for security before the product launches.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
They ran another test and it still failed :(
They used the following for the one input field in the data of the http header: name=%3Cscript%3Ealert%28123%29%3C%2Fscript%3E which basically translates to name=<script>alert(123);</script>
I've added alpha and alnum to some of the fields, which fixes the XSS vulnerability (touch wood) by removing the %, however, now the boss don't like it because what of O'Brien and double-barrel surnames...
I haven't come across the %3C as < problem reading up about XSS. Is there something wrong with my html character set or encoding or something?
I probably now have to write a custom filter, but that would be a huge pain to do that with every website and deployment. Please help, this is really frustrating.
EDIT:
if it's about escaping the form's output, how do I do that? The form submits to the same page - how do I escape if I only have in my view <?= $this->form ?>
How can I get Zend Form to escape it's output?

%3Cscript%3Ealert%28123%29%3C%2Fscript%3E is the URL-encoded form of <script>alert(123);</script>. Any time you include < in a form value, it will be submitted to the server as %3C. PHP will read and decode that back to < before anything in your application gets a look at it.
That is to say, there is no special encoding that you have to handle; you won't actually see %3C in your input, you see <. If you're failing to encode that for on-page display then you don't have even the most basic defenses against XSS.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
I'm afraid you have not fixed your XSS problems at all. You may have merely obfuscated them.
Input filtering is a depressingly common but quite wrong strategy for blocking XSS.
It is not the input that's the problem. As your boss says, there is no reason you shouldn't be able to input O'Brien. Or even <script>, like I am just now in this comment box. You should not attempt to strip tags in the input or even HTML-encode them, because who knows at input-time that the data is going to end up in an HTML page? You don't want your database filled with nonsense like 'Fish&Chips' which then ends up in an e-mail or other non-HTML context with weird HTML escapes in it.
HTML-encoding is an output-stage issue. Leave the incoming strings alone, keep them as raw strings in the database (of course, if you are hacking together queries in strings to put the data in the database instead of parameterised queries, you would need to SQL-escape the content at exactly that point). Then only when you are inserting the values in HTML, encode them:
Name: <?php echo htmlspecialchars($row['name']); ?>
If you have a load of dodgy code like echo "Name: $name"; then I'm afraid you have much rewriting to do to make it secure.
Hint: consider defining a function with a short name like h so you don't have to type htmlspecialchars so much. Don't use htmlentities which will usually-unnecessarily encode non-ASCII characters, which will also mess them up unless you supply a correct $charset argument.
(Or, if you are using Zend_View, $this->escape().)
Input validation is useful on an application-specific level, for things like ensuring telephone number fields contain numbers and not letters. It is not something you can apply globally to avoid having to think about the issues that arise when you put a string inside the context of another string—whether that's inside HTML, SQL, JavaScript string literals or one of the many other contexts that require escaping.

If you correctly escape strings every time you write them to the HTML page, you won't have any issues.
%3C is a URL-encoded <; it is decoded by the server.

Using Wordpress, can some one tell me the best way of sanitizing input?

I'm developing an application using Wordpress as a CMS.
I have a form with a lot of input fields which needs to be sanitized before stored in the database.
I want to prevent SQL injection, having javascript and PHP code injected and other harmful code.
Currently I'm using my own methods to sanitize data, but I feel that it might be better to use the functions which WP uses.
I have looked at Data Validation in Wordpress, but I'm unsure on how much of these functions I should use, and in what order. Can anyone tell what WP functions are best to use?
Currently I'm "sanitizing" my input by doing the following:
Because characters with accents (é, ô, æ, ø, å) got stored in a funny way in the Database (even though my tables are set to ENGINE=InnoDB, DEFAULT CHARSET=utf8 and COLLATE=utf8_danish_ci), I'm now converting input fields that can have accents, using htmlentities().
When creating the SQL string to input the data, I use mysql_real_escape_string().
I don't think this is enough to prevent attacks though. So suggestions to improvement is greatly appreciated.

Input “sanitisation” is bogus.
You shouldn't attempt to protect yourself from injection woes by filtering(*) or escaping input, you should work with raw strings until the time you put them into another context. At that point you need the correct escaping function for that context, which is mysql_real_escape_string for MySQL queries and htmlspecialchars for HTML output.
(WordPress adds its own escaping functions like esc_html, which are in principle no different.)
(*: well, except for application-specific requirements, like checking an e-mail address is really an e-mail address, ensuring a password is reasonable, and so on. There's also a reasonable argument for filtering out control characters at the input stage, though this is rarely actually done.)
I'm now converting input fields that can have accents, using htmlentities().
I strongly advise not doing that. Your database should contain raw text; you make it much harder to do database operations on the columns if you've encoded it as HTML. You're escaping characters such as < and " at the same time as non-ASCII characters too. When you get data from the database and use it for some other reason than copying it into the page, you've now got spurious HTML-escapes in the data. Don't HTML-escape until the final moment you're writing text to the page.
If you are having trouble getting non-ASCII characters into the database, that's a different problem which you should solve first instead of going for unsustainable workarounds like storing HTML-encoded data. There are a number of posts here all about getting PHP and databases to talk proper UTF-8, but the main thing is to make sure your HTML output pages themselves are correctly served as UTF-8 using the Content-Type header/meta. Then check your MySQL connection is set to UTF-8, eg using mysql_set_charset().
When creating the SQL string to input the data, I use mysql_real_escape_string().
Yes, that's correct. As long as you do this you are not vulnerable to SQL injection. You might be vulnerabile to HTML-injection (causing XSS) if you are HTML-escaping at the database end instead of the template output end. Because any string that hasn't gone through the database (eg. fetched directly from $_GET) won't have been HTML-escaped.

When to filter/sanitize data: before database insertion or before display?

As I prepare to tackle the issue of input data filtering and sanitization, I'm curious whether there's a best (or most used) practice? Is it better to filter/sanitize the data (of HTML, JavaScript, etc.) before inserting the data into the database, or should it be done when the data is being prepared for display in HTML?
A few notes:
I'm doing this in PHP, but I suspect the answer to this is language agnostic. But if you have any recommendations specific to PHP, please share!
This is not an issue of escaping the data for database insertion. I already have PDO handling that quite well.
Thanks!

When it comes to displaying user submitted data, the generally accepted mantra is to "Filter input, escape output."
I would recommend against escaping things like html entities, etc, before going into the database, because you never know when HTML will not be your display medium. Also, different types of situations require different types of output escaping. For example, embedding a string in Javascript requires different escaping than in HTML. Doing this before may lull yourself into a false sense of security.
So, the basic rule of thumb is, sanitize before use and specifically for that use; not pre-emptively.
(Please note, I am not talking about escaping output for SQL, just for display. Please still do escape data bound for an SQL string).

i like to have/store the data in original form.
i only escape/filter the data depending on the location where i'm using it.
on a webpage - encode all html
on sql - kill quotes
on url - urlencoding
on printers - encode escape commands
on what ever - encode it for that job

There are at least two types of filtering/sanitization you should care about :
SQL
HTML
Obviously, the first one has to be taken care of before/when inserting the data to the database, to prevent SQL Injections.
But you already know that, as you said, so I won't talk about it more.
The second one, on the other hand, is a more interesting question :
if your users must be able to edit their data, it is interesting to return it to them the same way they entered it at first ; which means you have to store a "non-html-specialchars-escaped" version.
if you want to have some HTML displayed, you'll maybe use something like HTMLPurifier : very powerful... But might require a bit too much resources if you are running it on every data when it has to be displayed...
So :
If you want to display some HTML, using a heavy tool to validate/filter it, I'd say you need to store an already filtered/whatever version into the database, to not destroy the server, re-creating it each time the data is displayed
but you also need to store the "original" version (see what I said before)
In that case, I'd probably store both versions into database, even if it takes more place... Or at least use some good caching mecanism, to not-recreate the clean version over and over again.
If you don't want to display any HTML, you will use htmlspecialchars or an equivalent, which is probably not that much of a CPU-eater... So it probably doesn't matter much
you still need to store the "original" version
but escaping when you are outputing the data might be OK.
BTW, the first solution is also nice if users are using something like bbcode/markdown/wiki when inputting the data, and you are rendering it in HTML...
At least, as long as it's displayed more often than it's updated -- and especially if you don't use any cache to store the clean HTML version.

Sanitize it for the database before putting it in the database, if necessary (i.e. if you're not using a database interactivity layer that handles that for you). Sanitize it for display before display.
Storing things in a presently unnecessary quoted form just causes too many problems.

I always say escape things immediately before passing them to the place they need to be escaped. Your database doesn't care about HTML, so escaping HTML before storing in the database is unnecessary. If you ever want to output as something other than HTML, or change which tags are allowed/disallowed, you might have a bit of work ahead of you. Also, it's easier to remember to do the escaping right when it needs to be done, than at some much earlier stage in the process.
It's also worth noting that HTML-escaped strings can be much longer than the original input. If I put a Japanese username in a registration form, the original string might only be 4 Unicode characters, but HTML escaping may convert it to a long string of "〹𐤲䡈穩". Then my 4-character username is too long for your database field, and gets stored as two Japanese characters plus half an escape code, which also probably prevents me from logging in.
Beware that browsers tend to escape some things like non-English text in submitted forms themselves, and there will always be that smartass who uses a Japanese username everywhere. So you may want to actually unescape HTML before storing.

Mostly it depends on what you are planning to do with the input, as well as your development environment.
In most cases you want original input. This way you get the power to tweak your output to your heart's content without fear of losing the original. This also allows you to troubleshoot issues such as broken output. You can always see how your filters are buggy or customer's input is erroneous.
On the other hand some short semantic data could be filtered immediately. 1) You don't want messy phone numbers in database, so for such things it could be good to sanitize. 2) You don't want some other programmer to accidentally output data without escaping, and you work in multiprogrammer environment. However, for most cases raw data is better IMO.

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.

One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.

You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars

For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.

A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)

My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).

You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)

If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..

The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.

OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).

Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.

there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.