PHP's mb_detect_encoding() doesn't understand the MacRoman encoding. My app allows users to upload data in csv format and I need to convert it to utf8 because the users are not tech-savvy. I will never be able to get all of them to understand how to do it and control their encoding.
This is what I'm doing:
$encoding_detection_order = array('UTF-8', 'UTF-7', 'ASCII', 'ISO-8859-1', 'EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP', );
$encoding = mb_detect_encoding($value, $detection_order, true);
$converted_value = iconv($encoding, 'UTF-8//TRANSLIT', $value);
This works great for most situations, but if my user is on a Mac and they save the CSV in MacRoman encoding, then the above code will usually wrongly detect the text as ISO-8859-1 which causes the iconv() to produce bad output.
For example, the accented-e in Jaimé has a hex value of 0x8e in MacRoman. In ISO-8859-1, the 0x8e character is Ž and so when I covert it to utf8, I just get the utf8 version of Ž when I should be getting é.
I need to be able to dynamically differentiate MacRoman from other encodings so that I convert it properly.
Related
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
I have a file, which contains some cyrillic characters. When I open this file in Notepad++ I see, that it has ANSI encoding. If I manually encode it into UTF-8 using Notepad++, then everything is absolutely ok - I can use this file in my parsers and get results. But what I want is to do it programmatically, using PHP. This is what I tried after searching through SO and documentation:
file_put_contents($file, utf8_encode(file_get_contents($file)));
In this case when my algorithm parses the resulting files, it meets such letters as "è", "í" , "â". In other words, in this case I get some rubbish. I also tried this:
file_put_contents($file, iconv('WINDOWS-1252', 'UTF-8', file_get_contents($file)));
But it produces the very same rubbish. So, I really wonder how can I achive programmatically what Notepad++ does. Thanks!
Notepad++ may report your encoding as ANSI but this does not necessarily equate to Windows-1252. 1252 is an encoding for the Latin alphabet, whereas 1251 is designed to encode Cyrillic script. So use
file_put_contents($file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($file)));
to convert from 1251 to utf-8 with iconv.
Using PHP CLI, this works well:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
This also works for CP437, Windows, Macintosh etc.
On apache, the SAME code results in:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
I googled around and added setlocale(LC_ALL, "en_US.utf8"); to the script, but made no difference. Thanks for helping!
I run Debian Linux with apache2 and php 5.4. I am trying to convert different CSV files as they are being uploaded into UTF-8 for processing.
UPDATE: I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
utf8_decode makes it show up correctly in the browser and when saved to the MySQL DB.
There are always two sides to encoding: the encoded string, and the entity interpreting this encoded string into readable characters! This "entity", as I'll ambiguously call it, can be the database, the browser, your text editor, the console, or whatever else.
$result = iconv('LATIN1', 'UTF-8', 'N�n��;M�tt');
Result is: Nönüß
Not sure where you're getting 'N�n��;M�tt' from exactly, but the UNICODE REPLACEMENT CHARACTERS � in there indicate that you're trying to interpret this string as UTF-8, but the string is not actually UTF-8 encoded. Using iconv to convert it from Latin-1 to UTF-8 makes the correct characters appear - that means the string was originally Latin-1 encoded and converting it to your expected encoding solved the discrepancy.
On apache, the SAME code results in Nönüß
That means the interpreting party here is not interpreting the string as UTF-8 this time, even though the string is UTF-8. I assume by "Apache" you mean "in the browser". You need to tell your browser through HTTP headers or HTML meta tags that it's supposed to interpret the text as UTF-8.
I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
Guess what utf8_decode does. It converts the encoding of a string from UTF-8 to Latin-1. So the above code converts Latin-1 to ... Latin-1.
Please read the following:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
I have a weird problem , the following code :
$str = "נסיון" // <--- Hebrew chars
echo mb_detect_encoding ($str)."<br><br><br>";
$str = iconv (mb_detect_encoding($str),'UCS-2BE',$str);
echo mb_detect_encoding ($str)."<br><br><br>";
This will output :
UTF-8
UTF-8
This code is written in a file that's encoded (using Notepad++) in UTF-8 Without BOM, trying other encodings and didn't work.
I also tried converting the string using :
$str = mb_convert_encoding($str,'UCS-2BE');
But that didn't work either. Any insights?
From the documentation for mb_detect_order, the function that establishes the order in which mb_detect_encoding tests different encodings:
mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-*, mbstring always detects as ISO-8859-*.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.
So, you can't detect the encoding of the second string with the mb functions.
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));