I have to convert some huge tables (>60 GB) from latin1 to utf8 and I'm looking for the best practice. One problem is that some tables contain serialized php objects.
My first approach was to set the TEXT columns to BLOB, convert the character set to utf8 and convert the columns back to TEXT, but I got some issues with the last step (incorrect string value: '\xE4\xF6\xFC\xDF";...').
What would be the best strategy to convert the values properly to utf8?
Given that the data is in latin1 encoding, such as the äöüß in your example, and that the column is CHARACTER SET latin1, see http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases , which says
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
(or utf8)
Note: That will change the charset for all text columns in the one table; and only the one table.
Related
In a large system based on Mysql 5.5.57 Php 5.6.37 setup
Currently the whole system is working in utf8 including SET NAMES utf8 at the beginning of each db connection.
I need to support emojis in one of the tables so I need to switch it to utf8mb4. I don't want to switch other tables.
My question is - if I change to SET NAMES utf8mb4 for all connections (utf8 and utf8mb4) and switch the specific table only to utf8mb4 (and only write mb4 data to this table). Will the rest of the system work as before?
Can there be any issue from working with SET NAMES utf8mb4 in the utf8 tables/data/connections?
I think there should no problem using SET NAMES utf8mb4 for all connections.
(utf8mb3 is a synonym of utf8 in MySQL; I'll use the former for clarity.)
utf8mb3 is a subset of utf8mb4, so your client's bytes will be happy either way (except for Emoji, which needs utf8mb4). When the bytes get to (or come from) a column that is declared only there will be a check to verify that you are not storing Emoji or certain Chinese characters, but otherwise, it goes through with minimal fuss.
I suggest
ALTER TABLE ... CONVERT TO utf8mb4
as the 'right' way to convert a table. However, it converts all varchar/text columns. This may be bad...
If you JOIN a converted table to an unconverted table, then you will be trying to compare a utf8mb3 string to a utf8mb4 string. MySQL will throw up its hands and convert all rows from one to the other. That is no INDEX will be useful.
So... Be sure to at least be consistent about any columns that are involved in JOINs.
I am unable to insert record to my database table with having data like (£ 2,000).
Can anyone help me with this ?
Thanks
Try encoding the table as:
ALTER TABLE <table_name> CONVERT TO
CHARACTER SET utf8
COLLATE utf8_general_ci;
And make sure the datatype of the column is varchar.
Also if you plan to output that data in a webpage you can to insert HTML Entities for Currencies in the db and then when you output it in a webpage, the HTML will render the entities as regular characters.
It works for me
It seems that changing the column type and collation type did this.I have changed the column type to varchar and collation to utf-8 and it worked.
I current have the following snippet of text in a text paragraph for my website
let’s get to it
The apostrophe character is part of the UTF-8 charset, and it saves properly in a table column that is designated a VARCHAR column, in the form
let’s get to it
Which is properly parsed by my client. If I put the same text into a TEXT column in MySQL, it's stored as the following:
letâs get to it.
Is there any reason the two would differ, and if so, how can I change it?
let’s is Mojibake. Latin1 is creeping in.
"text blob" -- which is it TEXT or BLOB? They are different datatypes.
letâs comes from htmlentities() or something equivalent. That can be stored and retrieved in VARCHAR, TEXT, or BLOB, regardless of CHARACTER SET. MySQL will not convert to that.
The Mojibake probably came from
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.
When I want to add the name of the project in Russian, the application saves the filled data in non-readable format into mysql database likeпроект номер три де фшоыÑшфыво шщфыовÑшщыв (проект номер три де фшоысшфыво шщфыовсшщыв) . But when i want to see the details about the current project, the view form of the project shows the data as its typed (e.g. проект номер три де фшоысшфыво шщфыовсшщыв).
Since the database is filled with non-utf8 format, the print view of project has the same inconveniences.
What should i change or delete so the inserting process of the data will be in proper way ?
Ð¿Ñ€Ð¾ÐµÐºÑ is Mojibake for проек.
This is the classic case of
The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.
If you need to fix for the data it takes a "2-step ALTER", something like
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
I try to explain the whole problem with my poor english:
I use to save data from my application (encoded on utf8) to database using the default connection of PHP (latin1) to the tables of my DB with latin1 as charset.
That wasn't a big problem : for example the string Magnüs was stored as Magnüs, and when I recovered the data I saw correctly the string Magnüs (because the default connection, latin1).
Now, I change the connection, using the correct charset, with mysql_query("SET NAMES 'utf8'", $mydb), and I've also changed the charset of my tables's fields, so the value now is correctly store as Magnüs on DB; Then I still seeing Magnüs when I retrieve the data and I print on my Web Application.
Of course, unfortunatly, some old values now are badly printed (Magnüs is printed as Magnüs).
What I'd like to do is "to convert" these old values with the real encoding.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8; will convert only the field type, not the data.
So, a solution (discovered on internet) should be this:
ALTER TABLE table CHANGE field field BLOB;
ALTER TABLE table CHANGE field field VARCHAR(255) CHARACTER SET utf8;
But these old string won't change on database, so neither in the Web Application when I print them.
Why? And what can I do?
Make sure that your forms are sending UTF-8 encoded text, and that the text in your table is also UTF-8 encoded.
According to the MySQL reference, the last two ALTER you mentioned do not change the column contents encoding, its more like a "reinterpretation" of the contents.
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.