Strange behaviour of mb_detect_order() in PHP

Strange behaviour of mb_detect_order() in PHP - php

I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.
The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.
Consider the following example
$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'
However if you change the order of encodings in mb_detect_order() the results will be different:
mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
die($originalEncoding); // $originalEncoding = 'EUC-JP'
So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

That's what I would expect to happen.
The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.
Something more intelligent requires statistical methods (I think machine learning is commonly used).
EDIT: See e.g. this article for more intelligent methods.
Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.
For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string "hello" would have an identical sequence of bytes in both encodings.
This is exactly why there is an mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like "hello" to be utf-8 or ISO-8859-1?

Keep in mind mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is - e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.
As you can imagine, given that context, it's quite difficult to detect an encoding reliably.
Like rihk said, this is what the mb_detect_order() function is for - you're basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn't likely to be UTF-16 even if mb_detect_encoding() could guess it as that.
You might also want to check out Artefacto's link for a more in-depth view.
Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (#link, Section: 'To automatically detect a website's language') that's caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.

mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.
The wikipedia list of charsets is a good place to see the characters that make up each charset.
Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.
As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.
If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

Related

Encoding/charset associated to PHP strings

The PHP documentation says:
Of course, in order to be useful, functions that operate on text may have to make some assumptions about how the string is encoded. Unfortunately, there is much variation on this matter throughout PHP’s functions:
[... a few special cases are described ...]
Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data [...]
Source: https://www.php.net/manual/en/language.types.string.php
So naturally my question is: Where are these specifications that allow us to identify the encoding/charset associated to string arguments, return values, constants, array keys/values, ... for built-in functions/methods/data (e.g. array_key_exists, DOMDocument::getElementsByTagName, DateTime::format, $_GET[$key], ini_set, PDO::__construct, json_decode, Exception::getMessage() and many more)? How do composer package providers specify the encodings in which they accept/provide textual data?
I have been working roughly with the following heuristic: (1) never change the encoding of anything, (2) when forced to pick an encoding, pick UTF-8. This has been working for years but it feels very unsatisfactory.
Whenever I try to find an answer to the question, I only get search results relating to url encoding, HTML entities or explaining the interpretation of string literals (with the source file's encoding).

Strings in PHP are what other languages would call byte arrays, i.e. purely a raw sequence of bytes. PHP is not generally interested in what characters those bytes represent, they're just bytes. Only functions that need to work with strings on a character level need to be aware of the encoding, anything else doesn't.
For example, array_key_exists doesn't need to know anything about characters to figure out whether a key with the same bytes as the given string exists in an array.
However, mb_strlen for example explicitly tells you how many characters the string consists of, so it needs to interpret the given string in a specific encoding to give you the right number of characters. mb_strlen('漢字', 'latin1') and mb_strlen('漢字', 'utf-8') give very different results. There isn't a unified way how these kinds of functions are made encoding aware*, you will need to consult their manual entries.
* The mb_ functions in particular generally use mb_internal_encoding(), but other sets of functions won't.
Functions like DateTime::format are looking for specific characters in the format string to replace by date values, e.g. d for the day, m for the month etc. You can generally assume that these are ASCII byte values it's looking for, unless specified otherwise (and I'm not aware of anything that specifies otherwise). So any ASCII compatible encoding will usually do.
For a lot more details, you may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Often this can be found in the official documentation, e.g., the DOMDocument class has a property encoding (determined by XML declaration). As for methods that return strings, I recommend reading this

How come that two identically encoded words look different in htmlentities?

I have a question concerning UTF-8 and htmlentities. I have two variables with greek text, both of them seem to be UTF-8 encoded (according to mb_detect_encoding()). When I output the two variables, they look exactly the same in the browser (also in the source code).
I was astonished when I realized, that a simple if($var1 == $var2) always failed although they seemed to be exactly the same. So I used htmlentities to see whether the html code would be the same. I was surprised when I saw that the first variable looked like this: Ï�ÎºÏ�Î»Î¿Ï� and the other one like this: ια&ro;. How can it be that two identical words with the same encoding (UTF-8) are nevertheless different? And how could I fix this problem?

Your first question was: How can it be that two identical words with the same encoding (UTF-8) are nevertheless different?
In this case, the encoding isn't really UTF-8 in both cases. The first variable is in "real" UTF-8, while in the second, greek characters are not really in UTF-8, but in ASCII, with non-ASCII characters (greek) encoded using something called a CER (Character Entity Reference).
A web browser and some too friendly "WYSIWYG" editors will render these strings as identical, but the binary representations of the actual strings (which is what the computer will compare) are different. This is why the equal test fails, even if the strings appear to be the same upon human visual ispection in a browser or editor.
I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way of telling utf-8 apart from ASCII using CER to represent non-ASCII.
Your second question was: How could I fix this problem?
Before you can compare strings that may be encoded differently, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.
Here is how I've solved it: I've implemented a handy function named utf8_normalize that converts just about any common character representation (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there must to some extent be determined by what are "popular" character representations in the type of environment your software will operate, but if you just make sure that your strings are on canonical form before comparing, it will work.
As noted in the comment below from the OP (phpheini), there also exists the PHP Normalizer class, which may do a better job of normalization that a home-grown function.

PHP Security: how can encoding be misused?

From this excellent "UTF-8 all the way through" question, I read about this:
Unfortunately, you should verify every submitted string as being valid
UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding() does the trick, but you have to use it
religiously. There's really no way around this, as malicious clients
can submit data in whatever encoding they want, and I haven't found a
trick to get PHP to do this for you reliably.
Now, I'm still learning the quirks of encoding, and I'd like to know exactly what malicious clients can do to abuse encoding. What can one achieve? Can somebody give an example? Let's say I save the user input into a MySQL database, or I send it through e-mail, how can a user create harm if I do not use the mb_check_encoding functionality?

how can a user create harm if I do not use the mb_check_encoding functionality?
This is about overlong encodings.
Due to an unfortunate quirk of UTF-8 design, it is possible to make byte sequences that, if parsed with a naïve bit-packing decoder, would result in the same character as a shorter sequence of bytes - including a single ASCII character.
For example the character < is usually represented as byte 0x3C, but could also be represented using the overlong UTF-8 sequence 0xC0 0xBC (or even more redundant 3- or 4-byte sequences).
If you take this input and handle it in a Unicode-oblivious byte-based tool, then any character processing step being used in that tool may be evaded. The canonical example would be submitting 0x80 0xBC to PHP, which has native byte strings. The typical use of htmlspecialchars to HTML-encode the character < would fail here because the expected byte sequence 0x3C is not present. So the output of the script would still include the overlong-encoded <, and any browser reading that output could potentially read the sequence 0x80 0xBC 0x73 0x63 0x72 0x69 0x70 0x74 as <script and hey presto! XSS.
Overlongs have been banned since way back and modern browsers no longer permit them. But this was a genuine problem for IE and Opera for a long time, and there's no guarantee every browser is going to get it right in future. And of course this is only one example - any place where a byte-oriented tool processes Unicode strings you've potentially got similar problems. The best approach, therefore, is to remove all overlongs at the earliest input phase.

Seems like this is a complicated attack. Checking the docs for mb_check_encoding gives note to a "Invalid Encoding Attack". Googling "Invalid Encoding Attack" brings up some interesting results that I will attempt to explain.
When this kind of data is sent to the server it will perform some decoding to interpret the characters being sent over. Now the server will do some security checks to look for the encoded version of some special characters that could be potentially harmful.
When invalid encoding is sent to the server, the server still runs its decoding algorithm and it will evaluate the invalid encoding. This is where the trouble happens because the security checks may not be looking for invalid variants that would still produce harmful characters when run through the decoding algorithm.
Example of an attack requesting a full directory listing on a unix system :
http://host/cgi-bin/bad.cgi?foo=..%c0%9v../bin/ls%20-al|
Here are some links if you would like a more detailed technical explanation of what is going on in the algorithms:
http://www.cgisecurity.com/owasp/html/ch11s03.html#id2862815
http://www.cgisecurity.com/fingerprinting-port-80-attacks-a-look-into-web-server-and-web-application-attack-signatures.html

___ encoding to UTF-8 - is there an end-all solution?

I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?

No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php

Do I need to make sure output data is valid UTF-8?

I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?

First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.
Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.
As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.

Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.
In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.
OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.
Cheers,

Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.

If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.
The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '<', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.
The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.