What does collation utf8mb4_unicode_ci mean - php

I was working on a project and wanted to implement a posts table similar to the wordpress posts table to store page content.
So I basically copied the wp_posts table which is longtext however I noticed under collation it had utf8mb4_unicode_ci
I'm wondering what this means and what its necessary for?

utf8mb4_unicode_ci support full unicode in mysql databases.
More information can be found here https://mathiasbynens.be/notes/mysql-utf8mb4
Basically there are many characters in Unicode that cant be stored in table with utf8, thus resulting in data loss.
UTF-8 symbols take one to three bytes, but there are symbols that can take even 4, and these werent supported (utf8 - utf8mb4).
In wordpress this change from utf8 collation was cause of problems for some users, mostly because utf8mb4_unicode_ci is supported only in MySQL 5.5.3+.

Related

Working with SET NAMES utf8mb4 with utf8 tables

In a large system based on Mysql 5.5.57 Php 5.6.37 setup
Currently the whole system is working in utf8 including SET NAMES utf8 at the beginning of each db connection.
I need to support emojis in one of the tables so I need to switch it to utf8mb4. I don't want to switch other tables.
My question is - if I change to SET NAMES utf8mb4 for all connections (utf8 and utf8mb4) and switch the specific table only to utf8mb4 (and only write mb4 data to this table). Will the rest of the system work as before?
Can there be any issue from working with SET NAMES utf8mb4 in the utf8 tables/data/connections?
I think there should no problem using SET NAMES utf8mb4 for all connections.
(utf8mb3 is a synonym of utf8 in MySQL; I'll use the former for clarity.)
utf8mb3 is a subset of utf8mb4, so your client's bytes will be happy either way (except for Emoji, which needs utf8mb4). When the bytes get to (or come from) a column that is declared only there will be a check to verify that you are not storing Emoji or certain Chinese characters, but otherwise, it goes through with minimal fuss.
I suggest
ALTER TABLE ... CONVERT TO utf8mb4
as the 'right' way to convert a table. However, it converts all varchar/text columns. This may be bad...
If you JOIN a converted table to an unconverted table, then you will be trying to compare a utf8mb3 string to a utf8mb4 string. MySQL will throw up its hands and convert all rows from one to the other. That is no INDEX will be useful.
So... Be sure to at least be consistent about any columns that are involved in JOINs.

utf8mb4_unicode_ci Selected in PhpMyAdmin but WordPress Tables using utf8mb4_unicode_520_ci Collation

I have selected utf8mb4_unicode_ci Collation (since this was recommended to use instead of latin..) in PhpMyAdmin in both options, in Server connection collation under General settings and in Database under Operations Tab of PhpMyAdmin
but tables in that database which are of WordPress blog, are using utf8mb4_unicode_520_ci Collation (which can be seen on main window e.g by clicking on that database)
My Question is, is this any bad thing or does it have any negative effect that I have selected utf8mb4_unicode_ci but Database for WordPress blog is using utf8mb4_unicode_520_ci tables in Database. All of the tables in that database are using utf8mb4_unicode_520_ci.
1) Should I change options from utf8mb4_unicode_ci to utf8mb4_unicode_520_ci in PhpMyAdmin (in both places as mentioned above)
2) Or it does not have any bad effect, I should leave it, as it is.
hoping to get answer for this query.
Thank You for reading.
When doing CREATE TABLE ..., the collation comes from:
You can explicitly state the collation with the CREATE, or
Defaulting to the database's collation (CREATE DATABASE ...)
Similarly, when declaring a column, you can be either explicit or default to the TABLE's settings.
I prefer to be explicit, not letting things default.
There is no harm when the database / table / column disagree on CHARACTER SET and/or COLLATION.
Until you get to MySQL 8.0, utf8mb4_unicode_520_ci is the "best" collation. (Best according to the Unicode standards committee.)

Multi language database encoding in search engine

I have a database(Mysql) in which I store more then 100 000 keywords with keyword in different languages. So an example if I have three colums [id] [turkish (utf8_turkish_ci)] [german(utf8)]
The users could enter a german or a turkish word in the search box. If the user enters a german word all is fine so it prints out the turkish word but how to solve it with the turkish one. I ask because each language has its own additional characters like ä ü ö ş etc.
So should I use
mb_convert_encoding
to convert the string but then how to check if it is a german or turkish string I think that would be to complex. Or is the encoding of the tables wrong?
Stuck now so how to implement it so the user could enter keyword of both languages words
You have several issues to solve to make this work correctly.
First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.
Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.
Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.
You can explicitly specify character set and collation in queries. For example, you can write
WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;
In this expression, _utf8 'München' is a character constant, and
constant COLLATE utf8_unicode_ci = table.name
is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html
Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.
Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

Confusion with utf8_general_ci & utf8_unicode_ci

Mysql server collation is utf8_general_ci in my.cnf
I am using utf8_general_ci collation for database, now i have created few tables with utf8_unicode_ci collation in
same database.
Now i would like to use utf8_unicode_ci for server/database/tables/fields. In order to do that first i need to change collation for server to utf8_unicode_ci
then for database, tables and fields.
My question is i already have data in tables stored using utf8_general_ci, can i just keep as it is without doing anything to data Or do i need to do any kind of conversion.
Other thing is, as you can see server level collation is utf8_general_ci but at table and field level is utf8_unicode_ci, so with my current setup when i store and retrieve data from these tables what collation mysql use?
Thank you.
"Server level" collation means nothing.
Server and database level charset (and collation) serve as mere default values for the table (and database) creation.
Say, if you didn't supply any collation when created a database, it will be created using server collation. But if you do - the supplied one will be used and server collation won't interfere at all.
If you didn't supply any collation in table definition, the table will be created using database collation. But if you do - the supplied one will be used and neither server nor database collation will affect your queries.
It's only table and field level collation that matters.
if i already have data in tables stored using utf8_general_ci, can i just keep as it is
Yes. You can have tables with any charset in your database.

Can't get the right characters to display from the database

I'm re-designing a Web site and I have a problem with the existing data base:
The database collate is set to utf8_unicode_ci and in the table row I'm calling the collate seems to be set to latin1_swedish_ci the characters store in it are Japanese (but even in phpmyadmin) you see other characters (I guess because of the latin1_swedish_ci).
When I print the result from the query I get a bunch of ??? now using
mysql_query('SET NAMES utf8');
mysql_set_charset('utf8',$conn);
Will output 2009â€N10ŒŽÂ†2009?N10???2009â€N11ŒŽÂ†2009?N11???
Any ideas?
Because the table was set to use latin1_swedish_ci, it was unable to correctly store the UTF-8 data that was entered. You need to switch that table to use utf8_unicode_ci for data going forward, but any existing data is essentially corrupted. You would have to re-enter the data after switching the collate to get the correct Japanese characters for the existing records.
You need to change the charset to utf8. The collation do not need to be changed to display japanese characters (but to be able to sort and compare texts it might be a good idea to change it to utf8_general_ci).
Hi all thanks for your reply's this is what happened, I couldn't really change anything in the DB since there's another version of the site that still uses that DB and will be up. So the solution I found was the following:
Case scenario:
The DB is set to use UTF8 -> (utf8_general_ci) but the field (at least the one's I needed where set to latin1_swedish_ci.
Solution:
After mysql_connect I put the following:
mysql_query("SET NAMES 'Shift_JIS'",$conn);
mysql_set_charset('Shift_JIS',$conn);
Then in the PHP file:
$titleJP = $row['titleJP'];
$titleJP = mb_convert_encoding($titleJP, "UTF-8", mb_detect_encoding($titleJP,"Shift_JIS,JIS,SJIS,eucjp-win"));
Now that worked perfectly the characters are displayed in correct Japanese.
I tried every other solution I could think of with no luck (utf-8_decode/encode php functions, etc.. etc..)

Categories