I'm using a cryptographic function in PHP (mcrypt_create_iv). I saw that in my database table that the field which stores this functions return value is of the latin1_swedish_ci charset, while in CodeIgniter (config/database.php) the charset is set to utf8.
I tested keeping the charset as utf8 in CI and running the method which stores the encrypted data into the tables column, but it returned a bunch of question marks and stuff that didn't make me feel confident that the mcrypt function worked.
So I changed CIs database charset to latin1, which is the same as the field in my databases table. My DB config file now looks like:
$db['default']['char_set'] = 'latin1';
$db['default']['dbcollat'] = 'utf8_general_ci';
I was wondering if there would be any problem caused by using both latin1 and utf8? I can feel that it just doesn't look right, using two different charsets and all, but in order to use the mcrypt_create_iv function (which is used to salt passwords, a big deal imo), I resorted to doing it anyway, hoping it wouldn't affect anything (i.e. inserting/getting data back correctly).
Could someone please shed some light, I would really appreciate it. Thanks
Using charset latin but UTF collation doesn't make a lot of sense. The latin charset will turn most unicode characters into "?" since they don't exist in the indicated charset. Using collation based on characters that are not in your chosen charset won't do anything.
So: if you want to be able to store all textual data, you'll want to change your charset utf8, and use utf8_general_ci collation. If you just want latin1 exclusively (I don't know why you would, but you might...) then use collation rules for latin as well.
If you do go with utf8, you'll also want to remember to, when you set up a connect to your database, ensure the connection also uses utf8 for its charset and names, so that you don't lose text "in transport" between your server and your database.
Related
Just now, I ran into a problem that I by mere chance had not encountered before:
In order to support emoji's in specific columns, I had decided to set my mysqli_set_charset() to utf8_mb4 and a few columns encoding within my database as well.
Now, I ran into the problem of PHP actually not correctly handling accented characters coming from normal utf8 encoded fields.
Now, I'm stuck with having mixed utf8 and utf8mb4 results. Since my data-handling is not very strong (used to work frameworks that handled it all for me) I'm quite unfamiliar with how I could best resolve this.
I have thought about the following options:
1 ) set my entire database to utf8mb4 collation instead of utf8 with a few exceptions.
2 ) use mysqli_set_charset() to change it, and simply make sure the queries getting said data are seperated
Now, neither of these seem like great ideas to me, but I can't really think of any better solution.
So then there's the remaining questions:
Will setting my entire db to utf8mb4 instead of utf8 be a big performance change? I do realise that utf8mb4 is bigger and therefore slower, which is why I tried to only use it on the columns in question in the first place.
Is there a way for me to simply have PHP handle utf8 encoding well, even when the mysqli_charset is onutf8mb4?
Do you have a better idea?
I'm at a real loss on this subject and I honestly can't guess which option is best. Googling on it didn't help too much as it only returned links explaining the differences of it or on how to convert your database to utf8mb4, so I would very much love to hear the thoughts on this of one of the wise SO colleagues!
Columns in this specific case:
My response including PHP's character encoding detection:
arri�n = UTF-8
bolsward = ASCII
go�nga = UTF-8
lo�nga = UTF-8
echt = ASCII
echteld = ASCII
echten (drenthe) = ASCII
echten (friesland) = ASCII
echtenerbrug = ASCII
echterbosch = ASCII
My MYSQLI charset:
mysqli_set_charset($this->getConn(), "utf8mb4");
-- and I just realised the problem was with my mysqli_set_charset. there indeed used to be an underscore in there...
It is spelled utf8mb4 (no underscore).
See Trouble with utf8 characters; what I see is not what I stored .
In particular, read "Overview of what you should do" in the answer.
You do not need to change the entire db. It is fine to specify utf8mb4 for only selected columns.
You do need to use utf8mb4 for the connection, but you specify 'UTF-8', which is the outside world's equivalent of MySQL's utf8mb4. MySQL's utf8 is a subset of utf8mb4. (Note: I am being precise in use of hyphens and underscores.)
utf8mb4 is not bigger, nor slower for transferring characters that are in common between utf8mb4 and the utf8 subset. Emoji are 4 bytes, so they are bigger than most other characters, but you are stuck with them being 4 bytes; don't sweat it.
Unfortunately all my databases (and Collations) are utf8_general_ci and I just recently learn it is better to use utf8_unicode_ci as it supports more Unicode characters accurately.
Will there be any issues if I use phpMyAdmin to change the Collations and database Table Charsets through their menus?
Also as I didn't know the importance of charsets, I have not been setting my MySQLi charsets for my database connection in PHP. Should I go through and do mysqli->set_charset("utf8") for all my connections? It is currently set as "latin1" by default. I assume this could be an issue as as I am storing as UTF8 but accepting latin1? (I am however declaring UTF8 on my html pages with
meta charset="utf-8"
I also read it might be better to go straight to utf8mb4? Again, would I have any issues changing that with phpMyAdmin and is it worth it? If I do go utf8mb4 do I have to do mysqli->set_charset('utf8mb4') ?
Thanks! I really should of done this to start.
Thanks!
CHARACTER SET is the encoding of the bytes. COLLATION is how characters are compared (for WHERE and ORDER BY).
You cannot trivially change either of those after the table is built. Instead you need to do some form of ALTER, probably ALTER ... CONVERT TO ....
The character set utf8mb4 has the advantage of handling all of Chinese (utf8 is missing some characters) and Emoji (the newer smileys).
The collation utf8_unicode_520_ci (or utf8mb4_unicode_520_ci for character set utf8mb4) is based on a newer Unicode standard, so it is arguably the 'best' available in MySQL.
So, yes,
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;
mysqli->set_charset('utf8mb4'); right after connecting.
In HTML, have <meta charset=UTF-8>
There is a chance that the CONVERT TO will come across "duplicate keys", since the _unicode_ collations work differently than *_general_ci. That won't happen for English, and won't happen for most of Europe. Two exceptions come to mind: the German ß in some UNIQUE or PRIMARY column, and any accented letters that are 'composed' of a 'non-spacing' accent together with a letter. (The latter is very rare.)
I have MYSQL database collation set to latin1_swedish_ci but my site uses encoding windows-1256. This means the data inside tables is encoded with windows-1256.
What is the correct way to convert my database tables/fields and data to utf-8 using iconv or any other library?
First, you need to verify that the data in the table(s) is really latin1. Could you do SELECT HEX(col), col ... to see what it looks like.
Depending on whether it is latin1 encoding or utf8 encoding (or something else) will determine what steps to perform. (If you do these steps without knowing, you could make things worse.)
These references give you the next steps:
http://dev.mysql.com/doc/refman/5.0/en/alter-table.html and/or
http://mysql.rjweb.org/doc.php/charcoll
According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.
I have a database filled with values like ♥•â—♥ Dhaka ♥•â—♥ (Which should be ♥•●♥ Dhaka ♥•●♥) as I didnt specify the collation while creating the database.
Now I want to Fix it. I cannot fetch the data again from where I got it from at the first place. So I was thinking if it might be possible to fetch the data in a php script and convert it to the correct characters.
I've changed the collation of the database and the fields to utf8_general_ci..
The collation is NOT the same as the character set. The collation is only used for sorting and comparison of text (that's why there's a language term in there). The actual character set may be different.
The most common failure is not in the database but rather in the connection between PHP and MySQL. The default charset for the connection is usually ISO-8859-1. You need to change that the first thing you do after connecting, using either the SQL query SET NAMES 'utf-8'; or the mysql_set_charset function.
Also check the character set of your tables. This may be wrong as well if you have not specified UTF-8 to begin with (again: this is not the same as the collation). But make sure to take a backup before changing anything here. MySQL will try to convert the charset from the previous one, so you may need to reload the data from backup if you have actually saved UTF-8 data in ISO-8859-1 tables.
I would look into mb_detect_encoding() and mb_convert_encoding() and see if they can help you.