I have an application which runs on PHP 5 and accesses and stores a MySQL database using the mysqli extension. The database contains numerous tables with the encoding UTF-8 (collation utf8_swedish_ci).
Unfortunately, it seems that the mysqli connection was configured to encode everything using ISO-8859-1,which means that I've got UTF-8 tables containing latin1 data. I am trying to repair this now, by shifting over everything to UTF-8 (as it should be!)
Is there a built-in way of handling this? If there isn't, how would you recommend I approach this issue?
Edit: A sample of what the data looks like while browsing through it all using PHPMyAdmin:
handelë (should be handelë)
√skal (should be √skal)
Also, the data is output correctly in the HTML document, as long as I use the output encoding UTF-8, but maintain the mysqli connection charset as latin1. It's all rather confusing,.
Very grateful for your help!
All right! So this is what must have happened:
user interface (UTF-8) → controller (UTF-8) → model (ISO-8859-1) → Database (UTF-8, but it receives ISO-8859-1)
So the fields were configured to use the UTF-8 encoding, but they receive ISO-8859-1 encoded data. I wanted to convert the incorrectly encoded data to UTF-8.
Since the data was in fact ISO-8559-1 encoded, I resolved my problem with the following little MySQL "hack":
UPDATE `table` SET `column` = convert(cast(convert(`column` using latin1) as binary) using utf8)
Courtesy ABS on StackOverflow.
Thank you for your time looking into my problem, guys! :)
Related
According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.
I have a Mysql database with all tables collated as 'utf8_unicode_ci'.
Also all data I wrote to the Database with php was encoded in utf8.
But I forgot to set the mysql connection encoding to utf8, so it probably defaulted to ISO-8859.
For a long time this was not a problem. Although special characters where displayed wrong in Tools like phpMyAdmin, the data was correct when loading it into my php application, as long as I kept using the wrong connection encoding.
But now I need to use my database from another application, that (correctly) does not use ISO-8859 as connection encoding and gets broken special characters.
Now I want to convert my database so I can use the right connection encoding.
I already tried this:
mysql wrong connection encoding
But I does not help for me. The closest I got to a solution was 'ut8_decode(utf8_decode($data))'.
But this breaks fields that start with a special character.
Additional Information:
So what might happen is the following:
My application sends some utf8 Data to the database.
Mysql gets the data but thinks (due to the connection encoding) that it is not utf8, and converts it, to fit for the 'utf8_unicode_ci' collation.
When my php application reads the data from the database mysql seems to undo the previous conversion so everything looks fine again from my php app.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
UTF-8 all the way through
okay, this is stupid that I can't figure it out.
Mysql database is set to utf8_general_ci collation. The field i'm having problems with is longtext type.
characters added to the database as é or other accented characters are returning as �.
I run the output through stripslashes and i've tried both with and without html_entity_decode but can find no change in the output. What am I doing wrong?
Cheers
What character encoding does the string have that you try to insert? If it is in ISO-8859-1 you can use the PHP function utf8_encode() to encode it to UTF-8 before inserting it into the database.
http://php.net/manual/en/function.utf8-encode.php
Getting encoding right is really tricky - there are too many layers:
Browser
Page
PHP
MySQL
The SQL command "SET CHARSET utf8" from PHP will ensure that the client side (PHP) will get the data in utf8, no matter how they are stored in the database. Of course, they need to be stored correctly first.
DDL definition vs. real data
Encoding defined for a table/column doesn't really mean that the data are in that encoding. If you happened to have a table defined as utf8 but stored as differtent encoding, then MySQL will treat them as utf8 and you're in trouble. Which means you have to fix this first.
What to check
You need to check in what encoding the data flow at each layer.
Check HTTP headers, headers.
Check what's really sent in body of the request.
Don't forget that MySQL has encoding almost everywhere:
Database
Tables
Columns
Server as a whole
Client
Make sure that there's the right one everywhere.
Conversion
If you receive data in e.g. windows-1250, and want to store in utf-8, then use this SQL before storing:
SET NAMES 'cp1250';
If you have data in DB as windows-1250 and want to retreive utf8, use:
SET CHARSET 'utf8';
Last note:
Don't rely on too "smart" tools to show the data. E.g. phpMyAdmin does (was doing when I was using it) encoding really bad. And it goes through all the layers so it's hard to find out. Also, Internet Explorer had really stupid behavior of "guessing" the encoding based on weird rules. Use simple editors where you can switch encoding. Also, I recommend MySQL Workbench.
I have made code that stores utf-8 in a database.
It shows it well in the browser but looks distorted in the database. Since the functionality seems to work and it doesn't look like I have had any problems with processing the string input, is it any point in 'fixing what is not broken' and make utf-8 characters like Japanese show in the database?
I don't search the database since the strings are serialized anyway.
You have to specify the text encoding of the queries, you are sending to MySQL with for instance
SET NAMES `utf8` COLLATE `utf8_unicode_ci`
If you don't, MySQL may interpret your query with the servers default text-encoding that can be different to UTF-8, e.g. iso-latin. So you will have strings in your tables, that are UTF-8 encoded, but MySQL marked them as iso-latin. That won't have much effect on your code, because MySQL just returns your UTF-8 strings back to you and you ignore the text-encoding. If you view the data in phpMyAdmin or any other application, that sets the connections character encoding, you will end up with distorted strings.
You could on the other hand utf8_decode your query strings and utf8_encode the result's provided by MySQL and don't change the connections text encoding from iso-latin. but if you query a different MySQL server that uses UTF-8 as default text encoding, you will end up with the same problem the other way around. so just set the connection's text encoding once after connecting.
What do you use to access the database. If you use a console just the the encoding in the console to utf-8. If you use GUI software just check the options the set the encoding to utf-8. You can try 'set names' to ser the client encoding.
I have been using php + mysql (phpmyadmin) to construct websites with Chinese contents (utf-8) for a long time.
When inputting forms, and also generate output php from db, the Chinese Words display well; but when I look at the database, although sometimes they are normal chinese characters, but something they are not (become strange strings), that made me notice that, the way that mysql handle and input data is not always utf-8.
Some experts on web mentioned, mysql were used to record the input data by latin1; nevertheless, I note that the existing charset in phpmyadmin is utf-8...
Will there be any solid way to detect the encoding format of chinese characters appeared in a phpmyadmin table cell?
Also, apart from mentioning at header of the page, will there be any method so that I can make sure the data entered to the db is utf-8 but not others?
Thank you.
The biggest problem that people encounter in this regard is that they don't tell MySQL that they're sending/expecting UTF-8 encoded data when connecting to the database, so MySQL thinks it's supposed to handle latin1 encoded data and converts it accordingly. Issue the command SET NAMES utf8 after connecting to the db or use mysql_set_charset.
in my case, it just because htmlentities(); Solution is change echo htmlentities($email_db); to echo htmlentities($email_db, ENT_COMPAT, 'UTF-8');