MySQL collation and PHP charset conflict - php

I have a bunch of Danish text taken from a latin-1 MySQL database and it displays correctly when echoed in PHP. The problem starts when I need to echo some other Danish characters, which are not taken from the database.
What I do is actually output the header
Content-Type: text/html; charset=iso-8859-1
to also let the non-queried characters to display correctly as well.
Problems is, when I do that the queried characters display incorrectly.

Just because the data is stored in a latin-1 collated table doesn't mean that it's latin-1 encoded. This is due to MySQL not doing any character translation when the connection SET NAMES setting is the same as the collation.
I suspect that you have some UTF8 characters stored in a latin1 database which is confusing the issue.
For more help please can you add details of the:
MySQL connection encoding that you have set
Details of where the "non-queried" characters are coming from

Use unicode. UTF-8 => the right way.
So, set utf8_unicode_ci in database, UTF-8 as page charset and before your query set mysql_query("SET NAMES UTF8");

Related

UTF8 versus Latin1

I am trying to understand the difference between Latin1 and UTF8 and for the most part I get it, however, when testing I am getting some weird results and could use some help clarifying
I am testing with 'é' (Latin small letter E with acute) and the link below shows the hex c3a9
I setup a database and specified utf8 as the character set, then created a table with utf8 as the character set and inserted a record with the character 'é' after setting the connection and client character set to UTF8
when I do a select hex(field), field from test_table I get:
hex(field), field
C3A9, é
This is fine and consistent with what I read, however, when I do the exact same using a latin1 character set I get the following:
hex(field), field
C3A9, é
but if I enter char(x'E9') which should be the single byte Latin1 equivalent value for é I manage to get it to display correctly using 'set names UTF8' but it doesn't show up correctly when setting the connection and client to Latin1
Can anyone clarify? - shouldn't Latin1 characters be single byte (Hex E9) in both UTF8 and Latin1? or am I completely misunderstanding it all?
Thanks
latin1 encoding has only 1-byte codes.
The first 128 codes (7-bits) are mostly identical between latin1 and utf8.
é is beyond the 128; it's 1-byte, 8-bit latin1 hex is E9 (as you observed). For utf8, it takes 2 bytes: C3A9. For most Asian characters, utf8 takes 3 byte; latin1 cannot represent those characters.
MySQL has the confusing command SET NAMES utf8. That announces that the client's encoding is utf8, and instructs the communication between client and server to convert between the column's CHARACTER SET and utf8 when reading/writing.
If you have SET NAMES latin1 (the old default), but the bytes in the client are encoded utf8, then you are 'lying', and various nasty things happen. But there is no immediate clue that something is wrong.
Checklist for going entirely utf8:
Bytes in client are utf8-encoded
SET NAMES utf8 (or equivalent parameter during connecting to MySQL)
CHARACTER SET utf8 on column or table declaration
<meta ... UTF-8> in html
After recently putting a website through the ringer UTF-8 wise I think this is a case of viewing UTF-8 data in a latin1 table within a UTF-8 encoded page or terminal.
If you are using a terminal you can check this by looking at the character encoding setting of the terminal (in Ubuntu it's Terminal -> Set Character Encoding). If you are using something like PHPMyAdmin, view the page source and look for the charset of the page, or open up Firebug and look at the response headers for the page, it should say "UTF-8".
If you've inserted the data and it's encoded in UTF-8 and it goes into a latin1 table then the data will still be stored in UTF-8, it's only when you start viewing that data or retrieving that data in a different encoding that you start getting the mangled effect.
I've found it's really crucial that when you are working with character encoding that you get everything the same: the page must have a charset of UTF-8, the upstream into the database must be in UTF-8, the database must have a default charset and storage of UTF-8. As soon as you put a different charset in the mix everything goes crazy.

why do i have to use mb_convert_encoding($name,'ISO-8859-15','utf-8') to get accented chars to display?

the data im working with here is off of a page that uses utf8 encoding
i've set my database and fields to use utf8_general_ci
now for whatever reason, i have to use the following code on the variable in order to have it display accented characters correctly in the database:
mb_convert_encoding($name,'ISO-8859-15','utf-8');
this makes no sense to me. why do i have to convert it to ISO-8859-15 when phpmyadmin is in utf8, the data is in utf8, and the database and table fields are in utf8?
You most likely have not set your database connection to UTF-8, so your database expects you to send ISO-8859 encoded data. See http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

utf8_encode and accented characters dilemma?

I am facing a paradox in decoding with utf8_encode decode. I have a MySQL database with uft8 collation and whose fields have the utf8_general coding. I have my php file in utf8, and in my HTML pages I have specified in the header the utf8 charset.
My problem is that when I select from my table a field containing accented characters (like èçò ùé) and echo that to the browser, I get strange characters.
To resolve my problem, I have to echo $description=utf8_encode($imm['description']).
My question is why can’t I do the echo directly without having to use uft8_encode every time?
I'll just guess that your database connection is not set to UTF-8.
See SET NAMES utf8 in MySQL?
you need to specify the header using php to be utf-8. also make sure that the format of the chars is utf-8 before storing in the db because utf_encode encodes an ISO-8859-1 string to UTF-8, which most likely means that the chars are being stored as ISO-8859-1 in s a utf-8 table.
make sure that you convert those chars in utf-8 before storing them in the db and then echo should not be a problem at all.
Source: had the exact same problem myself.

MySQL Collation or PHP side to display accented letters properly

What is the best Collation for the column that can allow to store accented letters and parse them out perfectly without any encoding error, because whenever I add an accented letter such as é, å, it shows out with an encoding problem on the PHP side, but in the MySQL side it's fine...
How do I get the accented letters display properly?
You get them correctly by matching the encoding on both ends, ie. both your PHP output and your DB should use the same encoding. For European languages I would suggest using UTF-8 for both your scripts and the DB. Just remember that you still have to initialize UTF-8 collation in MySQL using SET NAMES 'utf8' COLLATE 'utf8_general_ci' (so run this query just after you make a connection to the DB and you should be ok).
Perhaps your problem isn't within the database, but within however you're displaying things from PHP? What content encoding are you specifying in your output? You might need to manually send a header to specify that the content is UTF-8 if that's what you're trying to output.
For instance: header("Content-Type: text/html; charset=UTF-8");

Mysql: latin1-> utf8. Convert characters to their multibyte equivalents

There was a table in latin1 and site in cp1252
I want to have table in utf8 and site in utf-8
I've done:
1) on web page: Content-Type: text/html;charset=utf-8
2) Mysql: ALTER TABLE XXX CONVERT TO CHARACTER SET utf8
_
This SQL doesn't work as I want - it doesn't convert ä & ü characters in database to their multibyte equivalents
Please Help.
Tanks
As this blog post says, using MySQL's ALTER TABLE CONVERT syntax is A Bad Idea [TM]. Export your data, convert the table and then reimport the data, as described in the blog post.
Another idea: Have you set your default client connection charset via /etc/my.cnf or mysqli::set-charset .
I've been a fool. SET NAMES was missing.
What I know now:
1) Every time the charset of a column is changed, the actual data is ALWAYS recoded! Change field to binary to see that.
2) The charset of a column is prior!, the table and db charset follow in the priority. They are used mainly for setting defaults. (not 100% sure about last sentence)
3) SET NAMES is very important. German characters can come in latin1 and be placed get correctly in utf8 table(recoded by Mysql silently) when you SET NAMES correctly. The server can send data to a web page in the encoding you desire, no matter what the table encoding is. It can be recoded for output
4) If there is a column in encoding A and a column in encoding B, and you compare them (or use LIKE), the Mysql will silently convert them so that it looks like they are in one encoding
5) Mysql is smart. It never operates with text as with bytes unless the type is binary. It always operates as characters! He wants that ё in latin1 would equal ё in utf8 if he knows the data encoding
Since you claim you now get s**t back, it suggests that the characters were modified in the database.
How are you accessing the data in mysql? If you are using a programming interface such as PHP, then you may need to tell that interface what character encoding to expect.
For example, in PHP you will need to call something like mysql_set_charset("utf8"); but it can also be done with an SQL query of SET NAMES utf8
You will then also need to make sure that whatever is displaying the text knows it is utf8 and is rendering with an appropriate encoding. For example, on a web page you would need to set the content type to utf-8. something like Content-Type: text/html;charset=utf-8

Categories