mb_detect_encoding() discrepancy for non latin1 characters

mb_detect_encoding() discrepancy for non latin1 characters - php

I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.
Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:
$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);
print $res;
I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?

I wanted to be funny and say hexdump could explain it:
0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b
But alas, that's quite the opposite.
In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.
Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.
I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:
preg_match('/[\x7F-\x9F]/', $str);
I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

How to display the (extended) ASCII representation of a special character in PHP 5.6?

I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?

You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.

ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));

bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).

How to detect which type of chinese encoding has text file?

on http://www.gnu.org/software/libiconv/ there are like 20 types of encoding for Chinese:
Chinese EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950,
BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999,
ISO-2022-CN, ISO-2022-CN-EXT
So I have a text file that is not UTF-8. It's ASCII. And I want to convert it to UTF-8 using iconv(). But for that I need to know the character encoding of the source.
How can I do that if I don't know chinese? :(
I noticed that:
$str = iconv('GB18030', 'UTF-8', $str);
file_put_contents('file.txt', $str);
produces an UTF-8 encoded file, while other encodings I tried (CP950, GBK and EUC-CN) produced an ASCII file. Could that mean that iconv is able to detect if the input encoding is wrong for the given string?

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..
// some text from http://chinesenotes.com/chinese_text_l10n.php
// have tried both as string and content loaded from a file
$chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙';
$chinese=utf8_decode($chinese);
$chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT';
$encodings = explode(',',$chinese_encodings);
//set chinese locale
setlocale(LC_CTYPE, 'Chinese');
foreach($encodings as $encoding) {
if (#mb_check_encoding($chinese, $encoding)) {
echo 'The string seems to be compatible with '.$encoding.'<br>';
} else {
echo 'Not compatible with '.$encoding.'<br>';
}
}
outputs
The string seems to be compatible with EUC-CN
The string seems to be compatible with HZ
The string seems to be compatible with GBK
The string seems to be compatible with CP936
Not compatible with GB18030
The string seems to be compatible with EUC-TW
The string seems to be compatible with BIG5
The string seems to be compatible with CP950
Not compatible with BIG5-HKSCS
Not compatible with BIG5-HKSCS:2004
Not compatible with BIG5-HKSCS:2001
Not compatible with BIG5-HKSCS:1999
Not compatible with ISO-2022-CN
Not compatible with ISO-2022-CN-EXT
It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

I have zero experience with chinese encoding and I know this question is tagged iconv, but if it will get the job done, then you may try mb_detect_encoding to detect your encoding; The second argument is list of encodings to check, and there is a user-crafted comment about chinese encodings:
For Chinese developers: please note that the second argument of this
function DOES NOT include 'GB2312' and 'GBK' and the return value is
'EUC-CN' when it is detected as a GB2312 string.
so maybe it will work if you explicitly provide full list of chinese encodings as a second argument? It could work like this:
$encoding = mb_detect_encoding($chineseString, 'GB2312,GBK,(...)');
if($encoding) $utf8text = iconv($encoding, 'UTF-8', $str);
you may also want to play with third argument (strict)

What makes it hard to detect the encoding is the fact that octet sequences decode to valid characters in several encodings, but the result makes sense in only the correct encoding. What I've done in these cases is take the decoded text and go to an automatic translation service and see if you get back legible text or a jumble of syllables.
You can do this programmatically for example by analyzing trigraph frequencies in the input text. Libraries like this one have already been created to solve this problem, and there are external programs that do it, but I have yet to see anything with a PHP API. This approach is not fool-proof though.

Different utf8 encodings?

I`ve a small issue with utf8 encoding.
the word i try to encode is "kühl".
So it has a special character in it.
When i encode this string with utf8 in the first file i get:
kÃ¼hl
When i encode this string with utf8 in the second file i get:
kuÌ�hl
With php utf8_encode() i always get the first one (kÃ¼hl) as an output, but i would need the second one as output (kuÌ�hl).
mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.
do you have any ideas to get the second one as output?
thanks in advance!

There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kÃ¼hl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
kuÌ�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a ku..hl.
0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.
Apparently you could use class.normalize to convert between these two forms in PHP:
$normalized = Normalizer::normalize($input, Normalizer::FORM_D);
By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.

utf8_encode, despite it's name, does not magically encode into UTF-8.
It will only work, if your source is ISO-8559-1, also known as latin-1.
If your source was already UTF-8 or any other encoding it will output broken data.

Is testing for UTF-8 strings in PHP a reliable method?

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.
This function seems to work in all cases I tested.
Here is the question: Get file encoding
Here is the function:
function isUTF8($string){
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # Non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # Excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # Straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # Excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # Planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # Planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # Plane 16
)+%xs', $string);
}
Is this a reliable way of detecting UTF-8 strings?
What exactly is it doing? Can it be made more robust?

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.
In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.

Well, it only checks if the string has byte sequences that happen to correspond to valid UTF-8 code points. However, it won't flag the sequence 0x00-0x7F which is the ASCII compatible subset of UTF-8.
EDIT: Incidentally I am guessing the reason thought mb_detect_encoding() "didn't work properly" was because your Latin-1 encoded file only used the ASCII compatible subset, which also is valid in UTF-8. It's no wonder that mb_detect_encoding() would flag that as UTF-8 and it is "correct", if the data is just plain ASCII then the answer UTF-8 is as good as Latin-1, or ASCII, or any of the myriad extended ASCII encodings.

That will just detect if part of the string is a formally valid UTF-8 sequence, ignoring one code unit encoded characters (representing code points in ASCII). For that function to return true it suffices that there's one character that looks like a non-ASCII UTF-8 encoded character.

Basically, no.
Any UTF8 string is a valid 8-bit encoding string (even if it produces gibberish).
On the other hand, most 8-bit encoded strings with extended (128+) characters are not valid UTF8, but, as any other random byte sequence, they might happen to be.
And, of couse, any ASCII text is valid UTF8, so mb_detect_encoding is, in fact, correct by saying so. And no, you won't have any problems using ASCII text as UTF8. It's the reason UTF8 works in the first place.
As far as I understand, the function you supplied does not check for validity of the string, just that it contains some sequences that happen to be similar to those of UTF8, thus this function might misfire much worse. You may want to use both this function and mb_detect_encoding in strict mode and hope that they cancel out each others false positives.
If the text is written in a non-latin alphabet, a "smart" way to detect a multibyte encoding is to look for sequences of equally sized chunks of bytes starting with the same bits. For example, Russian word "привет" looks like this:
11010000 10111111
11010001 10000000
11010000 10111000
11010000 10110010
11010000 10110101
11010001 10000010
This, however, won't work for latin-based alphabets (and, probably, Chinese).

The function in question (the one that the user pilif posted in the linked question) appears to have been taken from this comment on the mb_detect_encoding() page in the PHP Manual:
As the author states, the function is only meant to "check if a string contains UTF-8 characters" and it only looks for "non-ascii multibyte sequences in the UTF-8 range". Therefore, the function returns false (zero actually) if your string just contains simple ascii characters (like english text), which is probably not what you want.
His function was based on another function in this previous comment on that same page which is, in fact, meant to check if a string is UTF-8 and was based on this regular expression created by someone at W3C.
Here is the original, correctly working (I've tested) function that will tell you if a string is UTF-8:
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8

This may not be the answer to your question (maybe it is, see the update below), but it could be the answer to your problem. Check out my Encoding class that has methods to convert strings to UTF8, no matter if they are encoded in Latin1, Win1252, or UTF8 already, or a mix of them:
Encoding::toUTF8($text_or_array);
Encoding::toWin1252($text_or_array);
Encoding::toISO8859($text_or_array);
// fixes UTF8 strings converted to UTF8 repeatedly:
// "FÃÂÃÂ©dÃÂÃÂ©ration" to "Fédération"
Encoding::fixUTF8($text_or_array);
https://stackoverflow.com/a/3479832/290221
The function runs byte by byte and figure out if each one of them needs conversion or not.
Update:
Thinking a little bit more about it, this could in fact be the answer to your question:
require_once('Encoding.php');
function validUTF8($string){
return Encoding::toUTF8($string) == $string;
}
And here is the Encoding class:
https://github.com/neitanod/forceutf8

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.