In a table x, there is a column with the values u and ü.
SELECT * FROM x WHERE column='u'.
This returns u AND ü, although I am only looking for the u.
The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone suggests to use this collation because they say that utf8mb4 really covers ALL CHARACTERS. With this collation, all character set and collation problems should be solved.
I can insert ü, è, é, à, Chinese characters, etc. When I make a SELECT *, they are also retrieved and displayed correctly.
The problem only occurs when I COMPARE two strings as in above example (SELECT WHERE) or when I use a UNIQUE INDEX on the column. When I use the UNIQUE INDEX, a "ü" is not inserted when I have a "u" in the column already. So, when SQL compares u and ü in order to decide whether the ü is unique, it thinks it is the same as the u and doesn't insert the ü.
I changed everything to utf8mb4 because I don't want to worry about character sets and collation anymore. However, it seems that utf8mb4 isn't the solution either when it comes to COMPARING strings.
I also tried this:
SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column.
This code is executable (looks pretty sophisticated). However, it also returns ü AND u.
I have talked to some people in India and here in China about this issue. We haven't found a solution yet.
If anyone could solve the mystery, it would be really great.
Add_On: After reading all the answers and comments below, here is a code sample which solves the problem:
SELECT * FROM x WHERE 'ü' COLLATE utf8mb4_bin = column
By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is invited to put the "binary glasses" (ending _bin) on when it looks at the characters in the column. With the binary glasses on, SQL sees now the binary code in the column. And the binary code is different for every letter and character and emoji which one can think of. So, SQL can now also see the difference between u and ü. Therefore, now it only returns the ü when the SELECT query looks for the ü and doesn't also return the u.
In this way, one can leave everything (database collation, table collation) the same, but only add "COLLATE utf8mb4_bin" to a query when exact differentiation is needed.
(Actually, SQL takes all other glasses off (utf8mb4_german_ci, _general_ci, _unicode_ci etc.) and only does what it does when it is not forced to do anything additional. It simply looks at the binary code and doesn't adjust its search to any special cultural background.)
Thanks everybody for the support, especially to Pred.
Collation and character set are two different things.
Character set is just an 'unordered' list of characters and their representation.
utf8mb4 is a character set and covers a lots of characters.
Collation defines the order of characters (determines the end result of order by for example) and defines other rules (such as which characters or character combinations should be treated as same). Collations are derived from character sets, there can be more than one collation for the same character set. (It is an extension to the character set - sorta)
In utf8mb4_unicode_ci all (most?) accented characters are treated as the same character, this is why you get u and ü. In short this collation is an accent insensitive collation.
This is similar to the fact that German collations treat ss and ß as same.
utf8mb4_bin is another collation and it treats all characters as different ones. You may or may not want to use it as default, this is up to you and your business rules.
You can also convert the collation in queries, but be aware, that doing so will prevent MySQL to use indexes.
Here is an example using a similar, but maybe a bit more familiar part of collations:
The ci at the end of the collations means Case Insensitive and almost all collations with ci has a pair ending with cs, meaning Case Sensitive.
When your column is case insensitive, the where condition column = 'foo' will find all of these: foo Foo fOo FoO FOo FoO fOO, FOO.
Now if you try to set the collation to case sensitive (utf8mb4_unicode_cs for example), all the above values are treated as different values.
The localized collations (like German, UK, US, Hungarian, whatever) follow the rules of the named language. In Germany ss and ß are the same and this is stated in the rules of the German language. When a German user searches for a value Straße, they will expect that a software (supporting german language or written in Germany) will return both Straße and Strasse.
To go further, when it comes to ordering, the two words are the same, they are equal, their meaning is the same so there is no particular order.
Don't forget, that the UNIQUE constraint is just a way of ordering/filtering values. So if there is a unique key defined on a column with German collation, it will not allow to insert both Straße and Strasse, since by the rules of the language, they should be treated as equal.
Now lets see our original collation: utf8mb4_unicode_ci, This is a 'universal' collation, which means, that it tries to simplify everything so since ü is not a really common character and most users have no idea how to type it in, this collation makes it equal to u. This is a simplification in order to support most of the languages, but as you already know, these kind of simplifications have some side effects. (like in ordering, filtering, using unique constraints, etc).
The utf8mb4_bin is the other end of the spectrum. This collation is designed to be as strict as it can be. To achieve this, it literally uses the character codes to distinguish characters. This means, each and every form of a character are different, this collation is implicitly case sensitive and accent sensitive.
Both of these have drawbacks: the localized and general collations are designed for one specific language or to provide a common solution. (utf8mb4_unicode_ci is the 'extension' of the old utf8_general_ci collation)
The binary requires extra caution when it comes to user interaction. Since it is CS and AS it can confuse users who are used to get the value 'Foo' when they are looking for the value 'foo'. Also as a developer, you have to be extra cautious when it comes to joins and other features. The INNER JOIN 'foo' = 'Foo' will return nothing, since 'foo' is not equal to 'Foo'.
I hope that these examples and explanation helps a bit.
utf8_collations.html lists what letters are 'equal' in the various utf8 (or utf8mb4) collations. With rare exceptions, all accents are stripped before comparing in any ..._ci collation. Some of the exceptions are language-specific, not Unicode in general. Example: In Icelandic É > E.
..._bin is the only collation that honors the treats accented letters as different. Ditto for case folding.
If you are doing a lot of comparing, you should change the collation of the column to ..._bin. When using the COLLATE clause in WHERE, an index cannot be used.
A note on ß. ss = ß in virtually all collations. In particular, utf8_general_ci (which used to be the the default) treated them as unequal. That one collation made no effort to treat any 2-letter combination (ss) as a single 'letter'. Also, due to a mistake in 5.0, utf8_general_mysql500_ci treats them unequal.
Going forward, utf8mb4_unicode_520_ci is the best through version 5.7. For 8.0, utf8mb4_0900_ai_ci is 'better'. The "520" and "900" refer to Unicode standards, so there may be even newer ones in the future.
You can try the utf8_bin collation and you shouldn't face this issue, but it will be case sensitive. The bin collations compare strictly, only separating the characters out according to the encoding selected, and once that's done, comparisons are done on a binary basis, much like many programming languages would compare strings.
I'll just add to the other answers that a _bin collation has its peculiarities as well.
For example, after the following:
CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');
this will fail:
INSERT INTO `dummy` (`key`) VALUES ('one ');
This is described in The binary Collation Compared to _bin Collations.
Edit: I've posted a related question here.
I have a database table with a column where I categorized Persian alphabetic letters to select with MySQL WHERE later. everything works fine for all letters, but I have a problem while selecting letter (چ) which is stored as (Ù†) in database and (ن) which is stored as (Ú†).
first I thought the problem could be from inserting same letters, but when I checked in database , letters where stored with different encoding I mean (Ù†) and (Ú†).
when I zoom in these letters the tick over U is different. both letters are echoed correctly when I echo them on webpage, but when I choose to select letters WHERE letter = 'چ' it shows letters with (ن) too!!!
all of the webpages that insert and read data from DB are in UTF-8 and database collation is utf_persian-ci.
I cant find where the problem is with this? any help is appreciated,
Mojibake. (or not; see below) Probably:
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.
For PHP:
⚈ mysqli interface: mysqli_set_charset('utf8') function.
⚈ PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8.
The COLLATION (eg, utf8_persion_ci) is not relevant to Mojibake. It is relevant to how characters are ordered.
Edit
You say "is stored as (Ù†)" -- How do you know? Most attempts to see what is stored are subject to the client fiddling with the bytes. This is a sure way to see what is there:
SELECT col, HEX(col) FROM tbl ...
For چ, the HEX should be DA86 for proper utf8 (or utf8mb4) encoding. If you get C39AE280A0, then you have "double encoding". In general, Arabic/Persian/Farsi should be of the form Dxyy.
If you read چ while connected with latin1, you will get Ù†, which is DA86 in latin1 encoding (Ù = DA and † = 86).
ن encodes as D986.
Double Encoding
I used hex(col) to send query and got C399E280A0 for ن and C39AE280A0 for چ .
So, you have "double encoding", not "Mojibake".
C399 is utf8 for Ù; E280A0 is utf8 for †. Your character was changed from latin1 to utf8 twice. Usually the end result is invisible to the outside world, but messed up in the table. That is because the SELECT decodes twice. However, since you are seeing only one decode, things are not that simple.
Caveat: You have a situation where I have not experimented; the advice I give you could be wrong.
Here's what probably happened.
The client had characters encoded as utf8 (good) hex: D986;
When inserting, the application lied by claiming that the client had latin1 encoding. (This is the old default.); D9 converted to Ù and 86 converted to †;
The column in the table declared CHARACTER SET utf8 (good). But now the Ù is stored as C399 and the † is stored as E280A0, for a total of 5 bytes;
When reading the connection claimed utf8 (good) for the client, so those 5 bytes were turned back into Ù†;
The client dutifully said the utf8 data was Ù†.
Notice the imbalance between the INSERT and the SELECT. You tagged this PHP; did PHP both write and read the data? Did it have a different setting for the charset for writing and reading?
The problem seems to be only in setting the charset for writing. It needed to be explicitly utf8, not defaulting to latin1.
But what about the data? If everything I said (about double encoding) matches what you have, then an UPDATE can fix the data. See my blog for the details.
This is a typical result of using a 'locale specific unicode encoding', in your case utf8_persian_ci. I expect that if you switch your collation to utf8_unicode_ci, it will work as expected.
If by any change you want to get rid of the case-insensitivity, you could switch to utf8_bin.
For further reference see the MySQL documentation.
I am converting a spreadsheet using PHPExcel to a Database and the cell value happens to contain Russian. If I run mb_detect_encoding() I am told the text is UTF8 and if I set a header of UTF8 then I see the correct Russian characters.
However if I compile it into a string (with only addslashes involved in the process) and insert it into the table I see lots of ????. I have set the table characterset as utf8mb4 and also set the collation as utf8mb4_general_ci. I have also run $this->db->query("SET NAMES 'utf8mb4'"); on my DB connection.
I run PDO query() with my multi part insert and get the ???s but if I output the query to screen I get ÐŸÐ¾Ñ which would be valid UTF8. Why would this not be stored correctly in the database?
I have kept this question rather than deleting it so someone may find the answer helpful.
The reason I was struggling was because in SQLYog it doesn't show you the column Charset by default. There is an option which reads "Hide language options" on the Alter table view which will then reveal that when SQLyog creates a table it uses the default server Charset as opposed to what you define the table Charset to be. I'm not sure if thats correct - but the solution simply is to turn on the Column Charset settings and check they match what you are expecting.
По is Mojibake for По. Probably...
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.
The question marks imply...
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
One way to help diagnose the problem(s) is to run
SELECT col, HEX(col) FROM tbl WHERE ...
For По, the hex should be D09FD0BE. Each Cyrillic character, in utf8, is hex D0xx.
We have a very large InnoDB MySQL 5.1 database with all tables using the latin1_swedish_ci collation. We want to convert all of the data which should be in ISO-8859-1 into UTF-8. How effective would changing the collation to utf8_general_ci be, if at all?
Would we be better off writing a script to convert the data and inserting into a new table? Obviously our goal is to minimise the risk of losing any data when re-encoding.
Edit: We do have accented character's, £ symbols etc.
If the data is currently using only latin characters and you are just wanted to change the character set and collation to UTF8 to enable future addition of UTF-8 data, then there should be no problem simply changing the character set and collation. I would do it in a copy of the table first of course.
About a week ago I had to do the same task (issues with ö, ä, å)
Created a dump.sql.
Searched and replaced all CHARSET=latin1 to CHARSET=utf8 (in the dump.sql).
searched and replaced all COLLATE=latin1_swedish_ci to COLLATE=utf8_unicode_ci (in the dump.sql).
Created a new database with the collation utf8_unicode_ci.
Imported the dump.sql.
Altered the the database's charset with alter database MY_DB charset=utf8;
and it worked perfectly
Note: after Mike Brant's remark, I think it's better better to do manual searching and replace for the fields you specifically want. Or you can simply use ALTER for each field without needing the dump.sql. It didn't make much change in my case, as most of my fields needed to be utf encoded
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before.
The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.
You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.
utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages
However, if you are only using this to store English text, these shouldn't differ.
Be very, very aware of this problem that can occur when using utf8_general_ci.
MySQL will not distinguish between some characters in select statements, when utf8_general_ci collation is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.
This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour has changed later.
I'm not a DBA, but to avoid this problem, I always go with utf8-bin instead of a case-insensitive one.
The script below describes the problem by example.
-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;
-- next, make sure that your client connection is of the same
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci
-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');
-- (verify)
SELECT * FROM `test`;
-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are
-- case insensitive (ending with _ci) do not distinguish between
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to
-- do the same with the 'latin1' charset:
--
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci
-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected). This shows
-- that the problem with utf8/utf8_generic_ci isn't present
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same
-- way (for any sceptics out there):
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Two keys.
--
DROP DATABASE sandbox;
It is best to use character set utf8mb4 with the collation utf8mb4_unicode_ci.
The character set, utf8, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8 only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4 supports all 17 planes.
MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.
The utf8mb4 character set was introduced in MySQL 5.5.3 on 2010-03-24.
Some of the required changes to use the new character set are not trivial:
Changes may need to be made in your application database adapter.
Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
SQL CREATE statements may need to include: ROW_FORMAT=DYNAMIC
DYNAMIC is required for indexes on VARCHAR(192) and larger.
NOTE: Switching to Barracuda from Antelope, may require restarting the MySQL service more than once. innodb_file_format_max does not change until after the MySQL service has been restarted to: innodb_file_format = barracuda.
MySQL uses the old Antelope InnoDB file format. Barracuda supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4
#1709 - Index column size too large. The maximum column size is 767 bytes.
#1071 - Specified key was too long; max key length is 767 bytes
The following scenario has been tested on MySQL 5.6.17:
By default, MySQL is configured like this:
SHOW VARIABLES;
innodb_large_prefix = OFF
innodb_file_format = Antelope
Stop your MySQL service and add the options to your existing my.cnf:
[client]
default-character-set= utf8mb4
[mysqld]
explicit_defaults_for_timestamp = true
innodb_large_prefix = true
innodb_file_format = barracuda
innodb_file_format_max = barracuda
innodb_file_per_table = true
# Character collation
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
Example SQL CREATE statement:
CREATE TABLE Contacts (
id INT AUTO_INCREMENT NOT NULL,
ownerId INT DEFAULT NULL,
created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
contact VARCHAR(640) NOT NULL,
prefix VARCHAR(128) NOT NULL,
first VARCHAR(128) NOT NULL,
middle VARCHAR(128) NOT NULL,
last VARCHAR(128) NOT NULL,
suffix VARCHAR(128) NOT NULL,
notes MEDIUMTEXT NOT NULL,
INDEX IDX_CA367725E05EFD25 (ownerId),
INDEX created (created),
INDEX modified_idx (modified),
INDEX contact_idx (contact),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
You can see error #1709 generated for INDEX contact_idx (contact) if ROW_FORMAT=DYNAMIC is removed from the CREATE statement.
NOTE: Changing the index to limit to the first 128 characters on contacteliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC
INDEX contact_idx (contact(128)),
Also note: when it says the size of the field is VARCHAR(128), that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.
This INSERT statement should contain the 4 byte 'poo' character in the 2 row:
INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES
(1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''),
(2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', ''),
(3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '123💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', '');
You can see the amount of space used by the last column:
mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`;
+--------------------+---------------------+
| BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) |
+--------------------+---------------------+
| 1024 | 128 | -- All characters are ASCII
| 4096 | 128 | -- All characters are 4 bytes
| 4024 | 128 | -- 3 characters are ASCII, 125 are 4 bytes
+--------------------+---------------------+
In your database adapter, you may want to set the charset and collation for your connection:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
In PHP, this would be set for: \PDO::MYSQL_ATTR_INIT_COMMAND
References:
Mysql 5.6 Reference Manual: Limits on InnoDB Tables
How to support full Unicode in MySQL databases
Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.
Example from the documentation for charset unicode:
utf8_general_ci also is satisfactory
for both German and French, except
that ‘ß’ is equal to ‘s’, and not to
‘ss’. If this is acceptable for your
application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because
it is more accurate.
So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.
Essentially, it depends on how you think of a string.
I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you're using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I'm a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.
In my books "value" should never in a million years be equal to "valúe".
If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().
For UTF-8 textual information, you should use utf8_general_ci because...
utf8_bin: compare strings by the
binary value of each character in
the string
utf8_general_ci: compare strings
using general language rules and
using case-insensitive comparisons
a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.
The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that's great, I wanted to relate my recent contrary experience just in case it saves anyone some time.
Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lot of places.
For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).
Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly -- otherwise queries using text strings can fail when your connection is using the default collation.
The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.
For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).
If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.
[Edit - nearly 6 years later]
I no longer recommend the "utf8" character set on MySQL, and instead recommend the "utf8mb4" character set. They match almost entirely, but allow for a little (lot) more unicode characters.
Realistically, MySQL should have updated the "utf8" character set and respective collations to match the "utf8" specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete "utf8" character set.
I found these collation charts helpful. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
In your database upload file, add the followin line before any line:
SET NAMES utf8;
And your problem should be solved.