HTML Purifier - Character Encoding

HTML Purifier - Character Encoding - php

I plan on using HTML Purify for the outputs of my webservice. I did not see an integrated "loggin" functionality to check what is replaced, so I wrote it myself.
However, the purifier() function automatically transforms my special character "entities".
For example:
& -> &
ø -> ø
The problem is now, that these will also be "logged" as my logging function compares the differences between the "purified" string and the original one. Is there a way to avoid this automatic encoding/decoding, or does anyone have better idea of how to check what is actually replaced?
Thank you!

The two examples you cite are actually two different use-cases; the one is because HTML Purifier is making your output safe (& -> &), the other is HTML Purifier using UTF-8 instead of entities because that's its internal representation.
Generally speaking, if your HTML is safe, HTML Purifier will output semantically equivalent HTML, it's not actually guaranteed to keep e.g. all whitespace or representation, because its focus is entirely on security, not idempotence for safe HTML, and it transforms incoming HTML quite heavily in the interest of thorough analysis.
You could force it to always turn all non-ASCII characters into entities with Core.EscapeNonASCIICharacters, but I doubt that's what you want - it will also change any UTF-8 that's not currently an entity into an entity. It also doesn't solve that unescaped HTML special characters will be escaped (& -> &) - HTML Purifier doesn't take chances, so even those HTML special characters that are coincidentally/contextually safe will always be encoded.
Instead, take a look at Core.CollectErrors. That should enable checking for the changes that you're looking for. Despite the warning in the docs, it is a solid feature. You can see an example usage of that feature here. The tl;dr is that to get the error collector, you use $purifier->context->get('ErrorCollector');, and to get your list of errors (which includes replacements), $errorCollector->getRaw(). Try that and see if it works?

Related

Is it safe to unescape ampersand for user input?

After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?

If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.

There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.

Allow only hexadecimal html entities

Got a forum and posting HTML is forbidden.
However, some users would like to have the possibility to post some symbolic signs, hexadecimal html entities, such as:
💗
See: http://graphemica.com/%F0%9F%92%97 for more info.
My questions are:
Is this safe to allow them such symbols at all (XSS, etc..)?
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
I want to disallow members using & or » and so on, so just html-entities starting with &# and followed by a number plus the semicolon at the end.
Any idea how to solve this?

Another answer is to use jQueries .text method to add the message to your forum message element.
Although you will have to change how your forum creates the message structure.
You can safely add any sequence of characters and none of them will be interpreted by the browser as HTML.
Example:
$('#message_text').text(naughty_msg_string);

Is this safe to allow them such symbols at all (XSS, etc..)?
No, this is never safe. For example, & is just a convenient alias for &, which is still an ampsersand. Similarly < is a lesser-than sign, and thus 'naively' allowing numeric HTML-entities can still open up an XSS attack surface, if you forget this during processing.
You could consider only allowing numeric symbols outside the main ASCII table (128+), which would be more safe.
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
Considering the above function, preg_replace_callback is a good candidate, as it allows you to test the content before (dis)allowing it.
This also answers the third question, as you can just test for numbers in the regexp.

Do I need to use HTML entities when storing data in the database?

I need to store special characters and symbols into mysql database. So either I can store it as it is like 'ü' or convert it to html code such as 'ü'
I am not sure which would be better.
Also I am having symbols like '♥', '„' .
Please suggest which one is better? Also suggest if there is any alternative method.
Thanks.

HTML entities have been introduced years ago to transport character information over the wire when transportation was not binary safe and for the case that the user-agent (browser) did not support the charset encoding of the transport-layer or server.
As a HTML entity contains only very basic characters (&, ;, a-z and 0-9) and those characters have the same binary encoding in most character sets, this is and was very safe from those side-effects.
However when you store something in the database, you don't have these issues because you're normally in control and you know what and how you can store text into the database.
For example, if you allow Unicode for text inside the database, you can store all characters, none is actually special. Note that you need to know your database here, there are some technical details you can run into. Like you don't know the charset encoding for your database connection so you can't exactly tell your database which text you want to store in there. But generally, you just store the text and retrieve it later. Nothing special to deal with.
In fact there are downsides when you use HTML entities instead of the plain character:
HTML entities consume more space: ü is much larger than ü in LATIN-1, UTF-8, UTF-16 or UTF-32.
HTML entities need further processing. They need to be created, and when read, they need to be parsed. Imagine you need to search for a specific text in your database, or any other action would need additional handling. That's just overhead.
The real fun starts when you mix both concepts. You come to a place you really don't want to go into. So just don't do it because you ain't gonna need it.

Leave your data raw in the database. Don't use HTML entities for these until you need them for HTML. You never know when you may want to use your data elsewhere, not on a web page.

My suggestion would mirror the other contributors, don't convert the special entities when saving them to your database.
Some reasons against conversion:
K.I.S.S principle (my biggest reason not to do it)
most entities will end up consuming more space then prior to being converted
loose the ability to search for the entities ü in a word, would be [word]+ü+[/word], and you would have to do a string comparison of the html equivalent of ü => [word]+ü+[/word].
your ouput may change from HTML to say an API for mobile, etc which makes conversion very unnecessary.
need to convert on input of data, and on output (again if your output changes from plain HTML to something else).

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.

One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.

You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars

For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.

A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)

My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).

You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)

If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..

The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.

OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).

Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.

there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

Are named entities in HTML still necessary in the age of Unicode aware browsers?

I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.
I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.
Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "<" for "<")
Update 2:
#Konrad: Are you saying that, no, named entities are not needed?
#Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)

Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (<, >, &, ", &apos;), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.
As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.

If using XHTML, it's actually recommended not to use named entities ([citation needed]). Some browsers (Firefox …), when parsing this as XML (which they normally don't), don't read the DTD files and thus are unable to handle the entities.
As it's best practice anyway to use UTF-8 as encoding if there are no compelling reasons to do otherwise, this only means that the creator of the documents needs a decent editor that can not only handle the documents but also provides a good way of entering the divers glyphs. OS X doesn't really have this problem because most needed glyphs can be reached via “alt” keys but Windows doesn't have this feature.
#Konrad: Are you saying that, no, named entities are not needed?
Precisely. Unless, of course, there are silly restrictions, e.g. legacy database drivers that choke on UTF-8 etc.

Safari seems to have issues with some glyphs but not others, it may not be needed but it's probably best to do so, of course, this is my opinion and not backed up by anything but my own observations.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.