Unicode in usernames (and passwords)? [closed]

Unicode in usernames (and passwords)? [closed] - php

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
After reviewing this I realised I still have a few questions left regarding the topic.
Are there any characters that should be 'left out' for legitimate security purposes? This includes all characters, such as brackets, commas, apostrophes, and parentheses.
While on this subject, I admittedly don't understand why admins seem to enjoy enforcing the "you can only use the alphabet, numbers, and spaces" rule. Does anything else have the potential to be a security flaw or break something I'm not aware of (even in ASCII)? As far as I've seen during my coding days there is absolutely no reason that any character should be barred from being in a username.

There's no security reason to not use certain characters. If you're properly handling all input, it doesn't make any difference whether you're only handling alphanumeric characters or Chinese.
It is easier to handle only alphnum usernames. You don't need to think about ambiguity with collations in your database, encoding usernames in URLs and things like that. But again, if you're properly handling it, there's no technical reason against it.
For practical reasons passwords are often only alphanumeric. Most password inputs don't accept IME input for example, so it's almost impossible to have a Japanese password. There's no security reason for disallowing non-alphanum characters though. On the contrary, the larger the usable alphabet, the better.

If your application handles Unicode input properly throughout, I'd certainly allow non-ASCII characters in usernames and passwords, with a few caveats:
If you use HTTP Basic Authentication, you can't properly support non-ASCII characters in usernames and passwords, because the process of passing those details involves an encode-to-bytes-in-base64 step that, currently, browsers don't agree on:
Safari uses ISO-8859-1, and breaks if there are any non-8859-1 characters present;
Mozilla uses the low byte of each character encoded to UTF-16 code units (same as ISO-8859-1 for those characters);
Opera and Chrome use UTF-8
IE uses the ANSI code page on the system it's installed on, which could be anything, but neever ISO-8859-1 or UTF-8. Characters that don't fit the encoding are arbitrarily mangled.
If you use cookies, you must ensure any Unicode characters are encoded in some way (eg URL-encoding), as once again trying to send non-ASCII characters gives vastly different results in different browsers.
"you can only use the alphabet, numbers, and spaces"
You get spaces? Luxury!

It are often exactly those characters which can be used to inject malicious code in your program. For example SQL injection (quotes, dashes, etc), XSS/CSRF (quotes, fish braces, etc) or even programming language injection when eval() is used elsewhere in your code.
Those characters does usually not harm when you as being the developer sanitize the user-controlled input/output properly, i.e. everything which comes in with the HTTP request; the headers, parameters and body. E.g. parameterized queries or using mysql_real_escape_string() when inlining them in a SQL query to prevent SQL injections and htmlspecialchars() when inlining them in HTML to prevent XSS. But I can imagine that admins don't trust all developers, so they add those restrictions.
See also:
OWASP on PHP top 5 vulrenabilities

I don't think there is a reason to not allow unicode in username. Passwords are different story, since you don't usually see password when you type it into a form, allowing only ASCII makes sense to prevent possible confusion.
I think it makes sense to use email address as the login credential rather than requiring create a new username. Then user can select any nickname, using any unicode characters and have that nick displayed next to user's posts and comments.
Isn't this how it's done on Facebook?

I think that most of the time when things (usernames or passwords) are being forced down to ASCII, it's because someone is afraid that more complex character sets will cause breakage in some unknown component. Whether this fear is justified or not is case dependent, but trying to verify that your entire stack really does Unicode correctly in all cases might be difficult. It's getting better every day, but you can still find problems with Unicode in some places.
I personally keep my usernames and passwords all ASCII, and I even try not to use too much punctuation. One reason is that some input devices (like some mobile phones) make it kind of difficult to get to some of the more esoteric characters. Another reason is that I've more than once encountered a system that had no restrictions on the password contents, but then screwed up if you actually used something other than a letter or number.

There is a risk involved if some parts of your program assume strings with different bytes are different, but other parts of the program would compare strings according to unicode semantics and think they're the same.
For example filesystems on Mac OS X enforce uniform representation of Unicode characters, so two different filenames Ą ('A with ogonek') and A+̨ (latin A followed by 'combining ogonek') will refer to the same file.
Similarly one can produce invalid UTF-8 byte sequences where 1-byte codepoints are encoded usnig multiple bytes (called overlong sequences). If you normalize or reject UTF-8 input before processing it it'll be safe, but e.g. if you use Unicode-ignorant programming language and Unicode-aware database these two will see different inputs.
So to avoid that:
You should filter UTF-8 input as early as possible. Reject invalid/overlong sequences.
When comparing Unicode stings always convert both sides of comparison to the same Unicode Normal Form. For usernames you might want NFKD to reduce amount of homograph attacks possible.

Related

How can I post data with overlong encoding to test for vulnerabilities?

I recently learned that overlong encodings cause a security risk when not properly validated. From the answer in the previously mentioned post:
For example the character < is usually represented as byte 0x3C, but
could also be represented using the overlong UTF-8 sequence 0xC0 0xBC
(or even more redundant 3- or 4-byte sequences).
And:
If you take this input and handle it in a Unicode-oblivious byte-based
tool, then any character processing step being used in that tool may
be evaded.
Meaning that if I use htmlspecialchars on a string that uses overlong encoding, then the output could still contain tags. I also assume that you could post similar characters (like " or ;) which could also be used for SQL injections.
Perhaps it is me, but I believe that this is a security risk relatively few people take into account and even know about. I've been coding for years and am only now finding this out.
Anyway, my question is: what tools can I use to send data with overlong encodings? People who are familiar with this risk: how do you perform tests on websites? I want to POST a bunch of overlong characters to my sites, but I have no idea how to do this.
In my situation I mostly use PHP and MySQL, but what I really want to know are testing tools, so I guess the back-end situation does not matter much.

I want to POST a bunch of overlong characters to my sites, but I have no idea how to do this.
Apart from testing it with manual request tools like curl, a simple workaround for in-browser testing is to override the encoding of the form submission. Using eg Firebug/Chrome Debugger, alter the form you're testing to add the attribute:
accept-charset="iso-8859-1"
You can now type characters that, when encoded as Windows code page 1252(*), become the UTF-8 overlong byte sequence you want.
For example, enter cafÃ© into the form and you will get the byte sequence c a f 0xC3 0xA9 so the application will think you typed café. Enter À¼foo and the sequence 0xC0 0xBC f o o will be submitted, which could be interpreted as <foo. Note that you won't see <foo in any output page source because modern browsers don't parse overlong UTF-8 sequences in web pages, but you might get a �foo or other indication something isn't right.
For more in-depth access to doctor the input and check the output of a webapp, see dedicated sec tools like Burp.

To test if your site is vulnerable use curl to fets your page using post and the encoding to the utf8 long and post utf8 long encoded information(you could use your text editor for this by setting the text editor encoding to utf8 long so the text you post using curl and the php file is in long)
http://php.net/manual/en/function.curl-setopt.php

PHP Security: how can encoding be misused?

From this excellent "UTF-8 all the way through" question, I read about this:
Unfortunately, you should verify every submitted string as being valid
UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding() does the trick, but you have to use it
religiously. There's really no way around this, as malicious clients
can submit data in whatever encoding they want, and I haven't found a
trick to get PHP to do this for you reliably.
Now, I'm still learning the quirks of encoding, and I'd like to know exactly what malicious clients can do to abuse encoding. What can one achieve? Can somebody give an example? Let's say I save the user input into a MySQL database, or I send it through e-mail, how can a user create harm if I do not use the mb_check_encoding functionality?

how can a user create harm if I do not use the mb_check_encoding functionality?
This is about overlong encodings.
Due to an unfortunate quirk of UTF-8 design, it is possible to make byte sequences that, if parsed with a naïve bit-packing decoder, would result in the same character as a shorter sequence of bytes - including a single ASCII character.
For example the character < is usually represented as byte 0x3C, but could also be represented using the overlong UTF-8 sequence 0xC0 0xBC (or even more redundant 3- or 4-byte sequences).
If you take this input and handle it in a Unicode-oblivious byte-based tool, then any character processing step being used in that tool may be evaded. The canonical example would be submitting 0x80 0xBC to PHP, which has native byte strings. The typical use of htmlspecialchars to HTML-encode the character < would fail here because the expected byte sequence 0x3C is not present. So the output of the script would still include the overlong-encoded <, and any browser reading that output could potentially read the sequence 0x80 0xBC 0x73 0x63 0x72 0x69 0x70 0x74 as <script and hey presto! XSS.
Overlongs have been banned since way back and modern browsers no longer permit them. But this was a genuine problem for IE and Opera for a long time, and there's no guarantee every browser is going to get it right in future. And of course this is only one example - any place where a byte-oriented tool processes Unicode strings you've potentially got similar problems. The best approach, therefore, is to remove all overlongs at the earliest input phase.

Seems like this is a complicated attack. Checking the docs for mb_check_encoding gives note to a "Invalid Encoding Attack". Googling "Invalid Encoding Attack" brings up some interesting results that I will attempt to explain.
When this kind of data is sent to the server it will perform some decoding to interpret the characters being sent over. Now the server will do some security checks to look for the encoded version of some special characters that could be potentially harmful.
When invalid encoding is sent to the server, the server still runs its decoding algorithm and it will evaluate the invalid encoding. This is where the trouble happens because the security checks may not be looking for invalid variants that would still produce harmful characters when run through the decoding algorithm.
Example of an attack requesting a full directory listing on a unix system :
http://host/cgi-bin/bad.cgi?foo=..%c0%9v../bin/ls%20-al|
Here are some links if you would like a more detailed technical explanation of what is going on in the algorithms:
http://www.cgisecurity.com/owasp/html/ch11s03.html#id2862815
http://www.cgisecurity.com/fingerprinting-port-80-attacks-a-look-into-web-server-and-web-application-attack-signatures.html

Is there any reason not to allow all characters for usernames in my php program?

For my first php website, I'm currently allowing all characters to be used in usernames. When they are inserted into the database, I use addslashes(), and when they are retrieved from the database, I use stripslashes(), and then I use htmlentities() to cause them to display properly on the page. I haven't had any problems so far, but are there any characters that I should disallow in usernames for any reason? HTML, CSS, and PHP are the only languages I'm fluent in, and I'm concerned that in the future I will come across functions in PHP or Java or some other language that will have difficulty parsing certain characters.

A number of characters could cause problems:
Obviously there are the characters with special meaning in html and SQL, which you have dealt with.
Other possibilities are:
For searching users, regular expression characters/wildcards, eg. *, ?
If you want to give users an email address, emails only support alphanumeric characters, underscores and non-adjacent dots, and many systems assume emails are case-insensitive (although not part of the specification)
If you want to give users a profile page where the url contains their username, many characters will needed to be encoded properly.
Non-ascii characters could cause problems, depending on how usernames are stored (if they are stored in fields supporting UTF-8, then any character is supported)

Do I need to make sure output data is valid UTF-8?

I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?

First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.
Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.
As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.

Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.
In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.
OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.
Cheers,

Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.

If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.
The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '<', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.
The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...

UTF-8 characters that aren't XSS vulnerabilities

I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded.
Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.
The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?

Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars
echo htmlspecialchars($string, ENT_QUOTES, 'utf-8');
But if this is sufficient or not depends on what you're printing (and where).
see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;)
The stringjavascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) passes htmlentities() unchanged. Now consider something like<a href="<?php echo htmlentities($_GET['homepage']); ?>"which will send<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">to the browser. And that boils down tohref="javascript:eval(\"alert('XSS')\")"While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.

In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:
Always ensure that what you're
sending to the client is tagged as
UTF-8. This means having a header
that explicitly says "Content-Type:
text/html; charset=utf-8" on every
single page, including all of your
error pages if any of the content on
those error pages is generated from
user input. (Many people forget to
test their 404 page, and have that
page include the not-found URL verbatim)
Always ensure that
what you're sending to the client is
valid UTF-8. This means you
cannot simply pass through
bytes received from the user back to
the user again. You need to decode
the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode
them as UTF-8 as you write them back
out.
The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)
The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.