I am trying to understand the difference between Latin1 and UTF8 and for the most part I get it, however, when testing I am getting some weird results and could use some help clarifying
I am testing with 'é' (Latin small letter E with acute) and the link below shows the hex c3a9
I setup a database and specified utf8 as the character set, then created a table with utf8 as the character set and inserted a record with the character 'é' after setting the connection and client character set to UTF8
when I do a select hex(field), field from test_table I get:
hex(field), field
C3A9, é
This is fine and consistent with what I read, however, when I do the exact same using a latin1 character set I get the following:
hex(field), field
C3A9, é
but if I enter char(x'E9') which should be the single byte Latin1 equivalent value for é I manage to get it to display correctly using 'set names UTF8' but it doesn't show up correctly when setting the connection and client to Latin1
Can anyone clarify? - shouldn't Latin1 characters be single byte (Hex E9) in both UTF8 and Latin1? or am I completely misunderstanding it all?
Thanks
latin1 encoding has only 1-byte codes.
The first 128 codes (7-bits) are mostly identical between latin1 and utf8.
é is beyond the 128; it's 1-byte, 8-bit latin1 hex is E9 (as you observed). For utf8, it takes 2 bytes: C3A9. For most Asian characters, utf8 takes 3 byte; latin1 cannot represent those characters.
MySQL has the confusing command SET NAMES utf8. That announces that the client's encoding is utf8, and instructs the communication between client and server to convert between the column's CHARACTER SET and utf8 when reading/writing.
If you have SET NAMES latin1 (the old default), but the bytes in the client are encoded utf8, then you are 'lying', and various nasty things happen. But there is no immediate clue that something is wrong.
Checklist for going entirely utf8:
Bytes in client are utf8-encoded
SET NAMES utf8 (or equivalent parameter during connecting to MySQL)
CHARACTER SET utf8 on column or table declaration
<meta ... UTF-8> in html
After recently putting a website through the ringer UTF-8 wise I think this is a case of viewing UTF-8 data in a latin1 table within a UTF-8 encoded page or terminal.
If you are using a terminal you can check this by looking at the character encoding setting of the terminal (in Ubuntu it's Terminal -> Set Character Encoding). If you are using something like PHPMyAdmin, view the page source and look for the charset of the page, or open up Firebug and look at the response headers for the page, it should say "UTF-8".
If you've inserted the data and it's encoded in UTF-8 and it goes into a latin1 table then the data will still be stored in UTF-8, it's only when you start viewing that data or retrieving that data in a different encoding that you start getting the mangled effect.
I've found it's really crucial that when you are working with character encoding that you get everything the same: the page must have a charset of UTF-8, the upstream into the database must be in UTF-8, the database must have a default charset and storage of UTF-8. As soon as you put a different charset in the mix everything goes crazy.
Related
I have did it for PHP older versions:
<?php require_once('Connections/SQLConn.php');
mysql_query("SET NAMES 'utf8'"); ?>
Now i want to update this obsolete code in order to update to PHP 5.6, well i tried with htmlentities(), htmlspecialchars(), changing by mysqli and it doesn't work.
example of typical black rhombus
My local database is collated by latin1_swedish_ci because my website is ready for both Spanish and English languages, and every single table of my db are collated by utf8_spanish_ci (if necessary for you to know it).
Do not use the mysql_* API. Switch to PDO or mysqli_*. In the latter statement is $conn->charset('utf8');
The black rhombus is usually caused by
The encoding of the text in the client is latin1
The connection (as being discussed) is latin1
The table column may be either latin1 or utf8 (either has the same effect)
The HTML code says <meta ... charset=UTF-8>
You need to either go latin1 all the way, in which case the charset in meta needs to be ISO-8859-1. Or go utf8 all the way.
Spanish works 'equally' well in latin1 as in utf8. But if you go east of western Europe, latin1 will be inadequate.
Since you have been messing around, please check what is in the table. ñ should be hex F1 if CHARACTER SET latin1, or C3B1 if utf8. If you see C3B1 in a latin1 table, you have another problem.
This is my environment: Client -> iOS App, Server ->PHP and MySQL.
The data from client to server is done via HTTP POST.
The data from server to client is done with json.
I would like to add support for emojis or any utf8mb4 character in general. I'm looking for the right way for dealing with this under my scenario.
My questions are the following:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols? If so, which encoding method should I use so that it works flawlessly in Objective-C and PHP (and java for the future android version)?
Right now I have the DB with utf8mb4 but I get errors when trying to store a raw emoji. On the other hand, I can store non-utf8 symbols such ¿ or á.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8 (if I get them in utf8mb4 the json_decode function doesn't work), then such symbols are encoded (e.g., ¿ is encoded to \u00bf).
MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Yes.
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?
God no, use raw UTF-8 (utf8mb4) for all that is holy.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8
Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.
if I get them in utf8mb4 the json_decode function doesn't work
You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:
echo json_encode('😀');
"\ud83d\ude00"
echo json_decode('"\ud83d\ude00"');
😀
Use utf8mb4 throughout MySQL:
SET NAMES utf8mb4
Declare the table/columns CHARACTER SET utf8mb4
Emoji and certain Chinese characters will work in utf8mb4, but not in MySQL's utf8.
Use UTF-8 throughout other things:
HTML:
¿ or á are (or at least can be) encoded in utf8 (utf8mb4)
I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4
looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php
Not only have I read things but I also made a test.
I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");
and, as expected, the characters looks messy once in the db.
Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?
Also why is utf8mb4 different from utf32?
MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.
What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.
mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.
UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.
This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.
php configurations script( api etc..)
header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mysql tables / or specific columns
utf8mb4
mysql PDO connection
$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';
(...your connection ...)
before execute query (might not be required):
$dbh->exec("set names utf8mb4");
utf-32: This is a character encoding using a fixed 4-bytes per characters
utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.
MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.
Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')
Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
I have a bunch of Danish text taken from a latin-1 MySQL database and it displays correctly when echoed in PHP. The problem starts when I need to echo some other Danish characters, which are not taken from the database.
What I do is actually output the header
Content-Type: text/html; charset=iso-8859-1
to also let the non-queried characters to display correctly as well.
Problems is, when I do that the queried characters display incorrectly.
Just because the data is stored in a latin-1 collated table doesn't mean that it's latin-1 encoded. This is due to MySQL not doing any character translation when the connection SET NAMES setting is the same as the collation.
I suspect that you have some UTF8 characters stored in a latin1 database which is confusing the issue.
For more help please can you add details of the:
MySQL connection encoding that you have set
Details of where the "non-queried" characters are coming from
Use unicode. UTF-8 => the right way.
So, set utf8_unicode_ci in database, UTF-8 as page charset and before your query set mysql_query("SET NAMES UTF8");
So, I currently have this problem - I have a sql db dump and the character encoding in it is latin1, but there are some utf8 chars in the file that look like Ä (should be ā) Ä« (should be ī) Å¡ (should be š) Ä“ (should be ē) etc. How do I convert these leters back to the original utf8.?
Character in the file <-> what it should have been <-> bytes
Ä“ <-> ē <-> 5
Ä <-> ā <-> 2
Å¡ <-> š <-> 4
Ä« <-> ī <-> 4
If you're seeing multiple bytes for what should be single characters, chances are it's already in UTF-8. Bear in mind that ISO-8859-1 is a single-byte-per-character encoding, whereas UTF-8 can take multiple bytes - and any non-ASCII character does take multiple bytes.
I suggest you open the file in a UTF-8-aware text editor, and check it there.
Encoding should be set on the connection on which you import data and read out data. If both of them are set to UTF-8, you will face no problems.
If you however import them with a latin1 connection, and later on reading it out with a UTF-8, you're in a world of trouble.
PHP internally only handles latin1, however that isn't nessecarily a problem for you.
If you have already wrongly imported the data, you would see a lot of ? or (diamond + ?) on your output I think.
But basically, when connecting frmo PHP, make sure to invoke SET NAMES 'utf8' first thing you do and see if that works.
If data still is wrong, you could use PHPs functions utf8_encode / utf8_decode to convert the data that is problematic.
In a working scenario they should never be used though.