Displaying user input as foreign languages

Displaying user input as foreign languages - php

I just recently encountered a problem that I wasn't really expecting. Up to this point I use htmlentites() to protect against XSS attacks. I don't make it an option for users to enter html so it works perfectly, except for when foreign users want to use the system, and htmlentites() displays their letters as gibberish.
I've used HTML purifier classes that are used along with WYSIWYG editors to protect against XSS, but it's a full class and it does make effect performance, and I don't really have to worry about allowing some html while blocking others because html isn't the problem, it
s just the special characters. If I have to I'll use one, but I was wondering if there was a built in php function for escaping html characters, but displaying foreign languages. I'd imagine their has to be, because PHP isn't only used by English speaking people.
Thanks!

Related

Confused with html encoding

I am getting confused with character encoding.
I understand people do things differently, but many suggest you should store your input in the database as it is entered, then deal with it when you are reading it in accordance with what you are planning to do with it. This makes sense to me.
So, if a user enters an apostrophe, double quote or ampersand, less than, greater than sign, these will be written in my database as ' " & < > respectively.
Now, reading the data with php, I am running the text through HTMLPurify to catch any injection issues.
Should I also htmlencode it? If I don't, it all appears OK (in Chrome and Firefox) but I am not sure if this is correct and will it display properly in other browsers?
If I use htmlentities with ENT_QUOTES, and htmlspecialchars, I start getting the codes coming through for these characters, which I believe is what I should see if looking at the page source, but not on the page the user sees.
The problem is, without doing the encoding, I am seeing what I want to see, but have this niggle in my mind, that I am not doing it correctly!

You have this confused. Character encoding is an attribute of YOUR systems. Your websites and your database are responsible for character encoding.
You have to decide what you will accept. I would say in general, the web has moved towards standardization on UTF-8. So if your websites that accept user input AND your database, and all connections involved are UTF-8, then you are in a position to accept input as UTF-8, and your character set and collation in the database should be configured appropriately.
At this point all your web pages should be HTML5, so the recommended HEAD section of your pages should at a minimum be this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
Next you have SQL injection. You specified PHP. If you are using mysqli or PDO (which is in my experience the better choice) AND you are using bindParameter for all your variables, there is NO ISSUE with SQL injection. That issue goes away, and the need for escaping input goes away, because you no longer have to be concerned that a SQL statement could get confused. It's not possible anymore.
Finally, you mentioned htmlpurifier. That exists so that people can try and avoid XSS and other exploits of that nature, that occur when you accept user input, and those people inject html & js.
That is always going to be a concern, depending on the nature of the system and what you do with that output, but as others suggested in comments, you can run sanitizers and filters on the output after you've retrieved it from the database. Sitting inside a php string variable there is no intrinsic danger, until you weaponize it by injecting it into a live html page you are serving.
In terms of finding bad actors and people trying to mess with your system, you are obviously much better off having stored the original input as submitted. Then as you come to understand the nature of these exploits, you can search through your database looking for specific things, which you won't be able to do if you sanitize first and store the result.

Sanitising form content in an internationalised scenario

In the good old days when I was a web developer (using PHP), I used to run all submitted form data through a regex before commencing any processing. For most cases, I would allow alphanumerics along with a small set of punctuation characters which would satisfy 99% of people 99% of the time whilst providing a defense against SQL injection and cross site scripting (yes I used PDO prepared statements as well).
More recently I've had to deal with input in an internationalised context, specifically, where the input can be in quite a few different western and eastern European languages as well as Arabic. In these cases, I resorted to removing potentially dangerous characters and letting everything else in. The application had a very small number of users (less than 10) and was only deployed on their internal network so I wasn't overly concerned about the security of the system but I wouldn't be comfortable taking this approach on a publicly accessible website.
In summary, I would like the input to be filtered so that what is left, is "plain text" but I'm not sure how to define the concept of plain text in an internationalised context. Are there any PHP libraries that address this?

Everything is "plain text". Even "' DROP TABLE users --" is plain text. Even "<script>" is just plain text.
What you're worried about are "special characters", i.e. plain text which has special meanings in certain contexts. For that, you need to escape theses special characters to "defuse" them in the given context. For HTML, escape them to HTML entities. For SQL, SQL-escape the string (or use prepared statements to avoid this problem in general). For CSV, CSV-escape the values... You get the idea. There are always functions or libraries available which will do this for you, don't try to reinvent the wheel here.
If you want to sanitize, i.e. remove content, you need to define better what you want to remove. Removing content also always runs the risk of removing legitimate content your users may want to use. So it's usually the annoying option.
For more on this topic, see The Great Escapism (Or: What You Need To Know To Work With Text Within Text).

Give strip_tags() a try. http://php.net/manual/en/function.strip-tags.php. It has worked for me for most english cases and might work for different languages.

How to prevent XSS attack with Zend Form using %

our company has made a website for our client. The client hired a webs security company to test the pages for security before the product launches.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
They ran another test and it still failed :(
They used the following for the one input field in the data of the http header: name=%3Cscript%3Ealert%28123%29%3C%2Fscript%3E which basically translates to name=<script>alert(123);</script>
I've added alpha and alnum to some of the fields, which fixes the XSS vulnerability (touch wood) by removing the %, however, now the boss don't like it because what of O'Brien and double-barrel surnames...
I haven't come across the %3C as < problem reading up about XSS. Is there something wrong with my html character set or encoding or something?
I probably now have to write a custom filter, but that would be a huge pain to do that with every website and deployment. Please help, this is really frustrating.
EDIT:
if it's about escaping the form's output, how do I do that? The form submits to the same page - how do I escape if I only have in my view <?= $this->form ?>
How can I get Zend Form to escape it's output?

%3Cscript%3Ealert%28123%29%3C%2Fscript%3E is the URL-encoded form of <script>alert(123);</script>. Any time you include < in a form value, it will be submitted to the server as %3C. PHP will read and decode that back to < before anything in your application gets a look at it.
That is to say, there is no special encoding that you have to handle; you won't actually see %3C in your input, you see <. If you're failing to encode that for on-page display then you don't have even the most basic defenses against XSS.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
I'm afraid you have not fixed your XSS problems at all. You may have merely obfuscated them.
Input filtering is a depressingly common but quite wrong strategy for blocking XSS.
It is not the input that's the problem. As your boss says, there is no reason you shouldn't be able to input O'Brien. Or even <script>, like I am just now in this comment box. You should not attempt to strip tags in the input or even HTML-encode them, because who knows at input-time that the data is going to end up in an HTML page? You don't want your database filled with nonsense like 'Fish&Chips' which then ends up in an e-mail or other non-HTML context with weird HTML escapes in it.
HTML-encoding is an output-stage issue. Leave the incoming strings alone, keep them as raw strings in the database (of course, if you are hacking together queries in strings to put the data in the database instead of parameterised queries, you would need to SQL-escape the content at exactly that point). Then only when you are inserting the values in HTML, encode them:
Name: <?php echo htmlspecialchars($row['name']); ?>
If you have a load of dodgy code like echo "Name: $name"; then I'm afraid you have much rewriting to do to make it secure.
Hint: consider defining a function with a short name like h so you don't have to type htmlspecialchars so much. Don't use htmlentities which will usually-unnecessarily encode non-ASCII characters, which will also mess them up unless you supply a correct $charset argument.
(Or, if you are using Zend_View, $this->escape().)
Input validation is useful on an application-specific level, for things like ensuring telephone number fields contain numbers and not letters. It is not something you can apply globally to avoid having to think about the issues that arise when you put a string inside the context of another string—whether that's inside HTML, SQL, JavaScript string literals or one of the many other contexts that require escaping.

If you correctly escape strings every time you write them to the HTML page, you won't have any issues.
%3C is a URL-encoded <; it is decoded by the server.

HTML Purifier - what to purify?

I am using HTML Purifier to protect my application from XSS attacks. Currently I am purifying content from WYSIWYG editors because that is the only place where users are allowed to use XHTML markup.
My question is, should I use HTML Purifier also on username and password in a login authentication system (or on input fields of sign up page such as email, name, address etc)? Is there a chance of XSS attack there?

You should Purify anything that will ever possibly be displayed on a page. Because with XSS attacks, hackers put in <script> tags or other malicious tags that can link to other sites.
Passwords and emails should be fine. Passwords should never be shown and emails should have their own validator to make sure that they are in the proper format.
Finally, always remember to put in htmlentities() on content.
Oh .. and look at filter_var aswell. Very nice way of filtering variables.

XSS risks exist where ever data entered by one user may be viewed by other users. Even if this data isn't currently viewable, don't assume that a need to do this won't arise.
As far as the username and password go, you should never display a password, or even store it in a form that can be displayed (i.e. encyrpt it with sha1()). For usernames, have a restriction on legal characters like [A-Za-z0-9_]. Finally, as the other answer suggests, use your languages html entity encoding function for any entered data that may contain reserved or special html characters, which prevents this data from causing syntax errors when displayed.

No, I wouldn't use HTMLPurifier on username and password during login authentication. In my appllications I use alphanumeric usernames and an input validation filter and display them with htmlspecialchars with ENT_QUOTES. This is very effective and a hell lot faster than HTMLpurifier. I'm yet to see an XSS attack using alphanumeric string. And BTW HTMLPurifier is useless when filtering alphanumeric content anyway so if you force the input string through an alphanumeric filter then there is no point to display it with HTMLpurifier. When it comes to passwords they should never be displayed to anybody in the first place which eliminates the possibility of XSS. And if for some perverse reason you want to display the passwords then you should design your application in such a way that it allows only the owner of the password to be able to see it, otherwise you are screwed big time and XSS is the least of your worry!

HTML Purifier takes HTML as input, and produces HTML as output. Its purpose is to allow the user to enter html with some tags, attributes, and values, while filtering out others. This uses a whitelist to prevent any data that can contain scripts. So this is useful for something like a WYSIWYG editor.
Usernames and passwords on the other hand are not HTML. They're plain text, so HTML purifier is not an option. Trying to use HTML Purifier here would either corrupt the data, or allow XSS attacks.
For example, it lets the following through unchanged, which can cause XSS issues when inserted as an attribute value in some elements:
" onclick="javascript:alert()" href="
Or if someone tried to use special symbols in their password, and entered:
<password
then their password would become blank, and make it much easier to guess.
Instead, you should encode the text. The encoding required depends on the context, but you can use htmlentities when outputting these values if you stick to rule #0 and rule #1, at the OWASP XSS Prevention Cheat Sheet

Users entering ampersand & character messing up my sites w3c validation

my social networking site is w3c xhtml valid however users are able to post blog reports and stuff and at times enter in ampersand characters which in turn mess up my validation. How can I fix this and are there any other single characters that I need to look out for that could mess up my validation?

When displaying user produced content, run it through the htmlspecialchars() function.

As a matter of general principle it's a mistake to include user-submitted (or indeed any external) content into your page directly without validation or filtering. Besides causing validation errors it can also cause "broken pages" and large security holes (cross-site scripting attacks).
Whenever you get data from anywhere that isn't 100% trusted, you need to make it safe in some way. You can do this by doing some or all of:
Escaping textual data so that special characters are replaced by the HTML entities that represent them.
Stripping or filtering unsafe HTML tags.
Validating that HTML doesn't contain any unsafe or illegal constructs.
If your user input is meant to be interpreted as text then you're mostly looking at option 1; if you're letting the users use HTML then you're looking at options 2 and 3. A fourth option is to have the users use some more restrictive non-HTML markup such as Markdown or bbCode, translating between that markup and HTML using a library that (hopefully) doesn't allow the injection of security holes, page-breaking constructs, or other scary things.

It's a bad idea to allow users to enter HTML markup.
This enables all kinds of nasty things, most notably cross-site scripting (XSS) exploits and injection of hidden spam (hidden from you, not search engine bots).
You should:
Obliterate all HTML tags using htmlspecialchars() and only preserve newlines with nl2br(). You might allow some formatting by implementing your own safe markup that allows only very specific tags (things like phpBB or Wiki-like markup).
Use HTML Purifier to reliably eliminate all potentially-dangerous markup. PHP's strip_tags() function is fundamentally broken and allows dangerous code in attributes if you use whitelist argument.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.