I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?
First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.
Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.
As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.
Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.
In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.
OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.
Cheers,
Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.
If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.
The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '<', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.
The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...
Related
I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?
No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.
You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)
'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.
From this excellent "UTF-8 all the way through" question, I read about this:
Unfortunately, you should verify every submitted string as being valid
UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding() does the trick, but you have to use it
religiously. There's really no way around this, as malicious clients
can submit data in whatever encoding they want, and I haven't found a
trick to get PHP to do this for you reliably.
Now, I'm still learning the quirks of encoding, and I'd like to know exactly what malicious clients can do to abuse encoding. What can one achieve? Can somebody give an example? Let's say I save the user input into a MySQL database, or I send it through e-mail, how can a user create harm if I do not use the mb_check_encoding functionality?
how can a user create harm if I do not use the mb_check_encoding functionality?
This is about overlong encodings.
Due to an unfortunate quirk of UTF-8 design, it is possible to make byte sequences that, if parsed with a naïve bit-packing decoder, would result in the same character as a shorter sequence of bytes - including a single ASCII character.
For example the character < is usually represented as byte 0x3C, but could also be represented using the overlong UTF-8 sequence 0xC0 0xBC (or even more redundant 3- or 4-byte sequences).
If you take this input and handle it in a Unicode-oblivious byte-based tool, then any character processing step being used in that tool may be evaded. The canonical example would be submitting 0x80 0xBC to PHP, which has native byte strings. The typical use of htmlspecialchars to HTML-encode the character < would fail here because the expected byte sequence 0x3C is not present. So the output of the script would still include the overlong-encoded <, and any browser reading that output could potentially read the sequence 0x80 0xBC 0x73 0x63 0x72 0x69 0x70 0x74 as <script and hey presto! XSS.
Overlongs have been banned since way back and modern browsers no longer permit them. But this was a genuine problem for IE and Opera for a long time, and there's no guarantee every browser is going to get it right in future. And of course this is only one example - any place where a byte-oriented tool processes Unicode strings you've potentially got similar problems. The best approach, therefore, is to remove all overlongs at the earliest input phase.
Seems like this is a complicated attack. Checking the docs for mb_check_encoding gives note to a "Invalid Encoding Attack". Googling "Invalid Encoding Attack" brings up some interesting results that I will attempt to explain.
When this kind of data is sent to the server it will perform some decoding to interpret the characters being sent over. Now the server will do some security checks to look for the encoded version of some special characters that could be potentially harmful.
When invalid encoding is sent to the server, the server still runs its decoding algorithm and it will evaluate the invalid encoding. This is where the trouble happens because the security checks may not be looking for invalid variants that would still produce harmful characters when run through the decoding algorithm.
Example of an attack requesting a full directory listing on a unix system :
http://host/cgi-bin/bad.cgi?foo=..%c0%9v../bin/ls%20-al|
Here are some links if you would like a more detailed technical explanation of what is going on in the algorithms:
http://www.cgisecurity.com/owasp/html/ch11s03.html#id2862815
http://www.cgisecurity.com/fingerprinting-port-80-attacks-a-look-into-web-server-and-web-application-attack-signatures.html
Never trust the input. But it is also true for the character encoding? Is good practice to control the encoding of the string received, to avoid unexpected errors? Some people use preg_match to check invalid string. Others make a control byte for byte to validate it. And who normalized using iconv. What is the fastest and safest way to do this check?
edit
I noticed that if I try to save a string utf-8 corrupted in my mysql database, the string will be truncated without warning. There are countermeasures for this eventuality?
Is good practice to control the encoding of the string received, to avoid unexpected errors?
No. There is no reliable way to detect the incoming data's encoding*, so the common practice is to define which encoding is expected:
If you are exposing an API of some sort, or a script that gets requests from third party sites, you will usually point out in the documentation what encoding you are expecting.
If you have forms on your site that are submitted to scripts, you will usually have a site-wide convention of which character set is used.
The possibility that broken data comes in is always there, if the declared encoding doesn't match the data's actual encoding. In that case, your application should be designed so there are no errors except that a character gets displayed the wrong way.
Looking at the encoding that the request declares the incoming data to be in like #Ignacio suggests is a very interesting idea, but I have never seen it implemented in the PHP world. That is not saying anything against it, but you were asking about common practices.
*: It is often possible to verify whether incoming data has a specific encoding. For example, UTF-8 has specific byte values that can't stand on their own, but form a multi-byte character. ISO-8859-1 special characters overlap with those values, and will therefore be detected as invalid in UTF-8. But detecting a completely unknown encoding from an arbitrary set of data is close to impossible.
Look at the charset specified in the request.
Your web publishes the webservice or produces the form and you can specify which encoding you expect. So if the input passes your validation everything is ok. If it doesn't you don't need to take care why it didn't pass. If it was due to wrong encoding it is not your fault.
I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?
No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)
The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters à and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters à and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.
Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}
If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php
I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded.
Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.
The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?
Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars
echo htmlspecialchars($string, ENT_QUOTES, 'utf-8');
But if this is sufficient or not depends on what you're printing (and where).
see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;)
The stringjavascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) passes htmlentities() unchanged. Now consider something like<a href="<?php echo htmlentities($_GET['homepage']); ?>"which will send<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">to the browser. And that boils down tohref="javascript:eval(\"alert('XSS')\")"While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.
In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:
Always ensure that what you're
sending to the client is tagged as
UTF-8. This means having a header
that explicitly says "Content-Type:
text/html; charset=utf-8" on every
single page, including all of your
error pages if any of the content on
those error pages is generated from
user input. (Many people forget to
test their 404 page, and have that
page include the not-found URL verbatim)
Always ensure that
what you're sending to the client is
valid UTF-8. This means you
cannot simply pass through
bytes received from the user back to
the user again. You need to decode
the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode
them as UTF-8 as you write them back
out.
The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)
The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.