Working with SET NAMES utf8mb4 with utf8 tables - php

In a large system based on Mysql 5.5.57 Php 5.6.37 setup
Currently the whole system is working in utf8 including SET NAMES utf8 at the beginning of each db connection.
I need to support emojis in one of the tables so I need to switch it to utf8mb4. I don't want to switch other tables.
My question is - if I change to SET NAMES utf8mb4 for all connections (utf8 and utf8mb4) and switch the specific table only to utf8mb4 (and only write mb4 data to this table). Will the rest of the system work as before?
Can there be any issue from working with SET NAMES utf8mb4 in the utf8 tables/data/connections?

I think there should no problem using SET NAMES utf8mb4 for all connections.
(utf8mb3 is a synonym of utf8 in MySQL; I'll use the former for clarity.)
utf8mb3 is a subset of utf8mb4, so your client's bytes will be happy either way (except for Emoji, which needs utf8mb4). When the bytes get to (or come from) a column that is declared only there will be a check to verify that you are not storing Emoji or certain Chinese characters, but otherwise, it goes through with minimal fuss.
I suggest
ALTER TABLE ... CONVERT TO utf8mb4
as the 'right' way to convert a table. However, it converts all varchar/text columns. This may be bad...
If you JOIN a converted table to an unconverted table, then you will be trying to compare a utf8mb3 string to a utf8mb4 string. MySQL will throw up its hands and convert all rows from one to the other. That is no INDEX will be useful.
So... Be sure to at least be consistent about any columns that are involved in JOINs.

Related

utf8mb4_unicode_ci Selected in PhpMyAdmin but WordPress Tables using utf8mb4_unicode_520_ci Collation

I have selected utf8mb4_unicode_ci Collation (since this was recommended to use instead of latin..) in PhpMyAdmin in both options, in Server connection collation under General settings and in Database under Operations Tab of PhpMyAdmin
but tables in that database which are of WordPress blog, are using utf8mb4_unicode_520_ci Collation (which can be seen on main window e.g by clicking on that database)
My Question is, is this any bad thing or does it have any negative effect that I have selected utf8mb4_unicode_ci but Database for WordPress blog is using utf8mb4_unicode_520_ci tables in Database. All of the tables in that database are using utf8mb4_unicode_520_ci.
1) Should I change options from utf8mb4_unicode_ci to utf8mb4_unicode_520_ci in PhpMyAdmin (in both places as mentioned above)
2) Or it does not have any bad effect, I should leave it, as it is.
hoping to get answer for this query.
Thank You for reading.
When doing CREATE TABLE ..., the collation comes from:
You can explicitly state the collation with the CREATE, or
Defaulting to the database's collation (CREATE DATABASE ...)
Similarly, when declaring a column, you can be either explicit or default to the TABLE's settings.
I prefer to be explicit, not letting things default.
There is no harm when the database / table / column disagree on CHARACTER SET and/or COLLATION.
Until you get to MySQL 8.0, utf8mb4_unicode_520_ci is the "best" collation. (Best according to the Unicode standards committee.)

How to convert latin1 table to utf8 with serialized values?

I have to convert some huge tables (>60 GB) from latin1 to utf8 and I'm looking for the best practice. One problem is that some tables contain serialized php objects.
My first approach was to set the TEXT columns to BLOB, convert the character set to utf8 and convert the columns back to TEXT, but I got some issues with the last step (incorrect string value: '\xE4\xF6\xFC\xDF";...').
What would be the best strategy to convert the values properly to utf8?
Given that the data is in latin1 encoding, such as the äöüß in your example, and that the column is CHARACTER SET latin1, see http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases , which says
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
(or utf8)
Note: That will change the charset for all text columns in the one table; and only the one table.

What does collation utf8mb4_unicode_ci mean

I was working on a project and wanted to implement a posts table similar to the wordpress posts table to store page content.
So I basically copied the wp_posts table which is longtext however I noticed under collation it had utf8mb4_unicode_ci
I'm wondering what this means and what its necessary for?
utf8mb4_unicode_ci support full unicode in mysql databases.
More information can be found here https://mathiasbynens.be/notes/mysql-utf8mb4
Basically there are many characters in Unicode that cant be stored in table with utf8, thus resulting in data loss.
UTF-8 symbols take one to three bytes, but there are symbols that can take even 4, and these werent supported (utf8 - utf8mb4).
In wordpress this change from utf8 collation was cause of problems for some users, mostly because utf8mb4_unicode_ci is supported only in MySQL 5.5.3+.

Issue with charset and data

I try to explain the whole problem with my poor english:
I use to save data from my application (encoded on utf8) to database using the default connection of PHP (latin1) to the tables of my DB with latin1 as charset.
That wasn't a big problem : for example the string Magnüs was stored as Magnüs, and when I recovered the data I saw correctly the string Magnüs (because the default connection, latin1).
Now, I change the connection, using the correct charset, with mysql_query("SET NAMES 'utf8'", $mydb), and I've also changed the charset of my tables's fields, so the value now is correctly store as Magnüs on DB; Then I still seeing Magnüs when I retrieve the data and I print on my Web Application.
Of course, unfortunatly, some old values now are badly printed (Magnüs is printed as Magnüs).
What I'd like to do is "to convert" these old values with the real encoding.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8; will convert only the field type, not the data.
So, a solution (discovered on internet) should be this:
ALTER TABLE table CHANGE field field BLOB;
ALTER TABLE table CHANGE field field VARCHAR(255) CHARACTER SET utf8;
But these old string won't change on database, so neither in the Web Application when I print them.
Why? And what can I do?
Make sure that your forms are sending UTF-8 encoded text, and that the text in your table is also UTF-8 encoded.
According to the MySQL reference, the last two ALTER you mentioned do not change the column contents encoding, its more like a "reinterpretation" of the contents.
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.

Can't get the right characters to display from the database

I'm re-designing a Web site and I have a problem with the existing data base:
The database collate is set to utf8_unicode_ci and in the table row I'm calling the collate seems to be set to latin1_swedish_ci the characters store in it are Japanese (but even in phpmyadmin) you see other characters (I guess because of the latin1_swedish_ci).
When I print the result from the query I get a bunch of ??? now using
mysql_query('SET NAMES utf8');
mysql_set_charset('utf8',$conn);
Will output 2009â€N10ŒŽÂ†2009?N10???2009â€N11ŒŽÂ†2009?N11???
Any ideas?
Because the table was set to use latin1_swedish_ci, it was unable to correctly store the UTF-8 data that was entered. You need to switch that table to use utf8_unicode_ci for data going forward, but any existing data is essentially corrupted. You would have to re-enter the data after switching the collate to get the correct Japanese characters for the existing records.
You need to change the charset to utf8. The collation do not need to be changed to display japanese characters (but to be able to sort and compare texts it might be a good idea to change it to utf8_general_ci).
Hi all thanks for your reply's this is what happened, I couldn't really change anything in the DB since there's another version of the site that still uses that DB and will be up. So the solution I found was the following:
Case scenario:
The DB is set to use UTF8 -> (utf8_general_ci) but the field (at least the one's I needed where set to latin1_swedish_ci.
Solution:
After mysql_connect I put the following:
mysql_query("SET NAMES 'Shift_JIS'",$conn);
mysql_set_charset('Shift_JIS',$conn);
Then in the PHP file:
$titleJP = $row['titleJP'];
$titleJP = mb_convert_encoding($titleJP, "UTF-8", mb_detect_encoding($titleJP,"Shift_JIS,JIS,SJIS,eucjp-win"));
Now that worked perfectly the characters are displayed in correct Japanese.
I tried every other solution I could think of with no luck (utf-8_decode/encode php functions, etc.. etc..)

Categories