Connection string is like;
firebird:dbname=PRODUCTS.GDB;charset=UTF8
But unicode characters are not correctly returned. I tried changing it to utf-8 with and without dash, small and big letters, to other charsets like ISO8859_9.. All is the same.
The problem is that you are using character set NONE for the columns. For columns with character set NONE all bets are off as Firebird is unable to transliterate to the specified connection character set and will send the data as is. The handling is specific to the client application or driver, some will apply the default system encoding, others will just assume it is in the connection character set they expected (in your case UTF-8), etc. Doing this may even lead to logical data corruption (eg because you are storing it in UTF-8 and another application is retrieving it expecting Windows-1254 or ISO-8859-9).
The fact it may display correct in another application, may be because that application assumes the stored data is in a certain character set and guesses right.
I don't know PHP, nor PDO, but a workaround might be to specify the actual character set of the data (eg WIN1254 instead of UTF8) in the connection string as this may lead to the characters being correctly converted.
However, the only real solution is to create a new database with a default character set other than NONE, execute the DDL (and specifying explicit character sets for columns that need to have a different one), and then pump the data from the old to the new database, making sure you apply the right character set conversion(s).
When this is done you will also need to ensure that all applications connecting to this database will use an explicit connection character set.
Related
I am working on a website with MySQL database on a Linux server.
Using phpMyAdmin, on the database, it says
MyISAM is the default storage engine on this MySQL server
latin1_swedish_ci
However, I have created all the tables with InnoDB and utf8_unicode_ci. I also checked that the table fields for all tables is utf8_unicode_ci.
Yet, when I mysql_fetch_array, and echo to stream, it gives gibberish. I had to explicitly set mysql_set_charset('utf8') for the text to appear correctly.
PHP version is 5.3.9; MySQL version is 5.1.70-cll - MySQL Community Server (GPL).
This is the first time I encountered this problem and I never had to set charset before.
What caused the text fetched by php mysql_* to be gibberish? Under what circumstance is it necessary to mysql_set_charset?
EDIT: This is not a question to attract suggestion to use alternative library e.g. mysqli, pdo. I just want to understand about this current situtation about the behavior of MySQL and charsets. Thanks.
When exchanging data between two systems, there's always the question "what encoding will text be sent in?" "Text" is represented simply as binary data, just long strings of 1s and 0s. These could mean anything at all. There are hundreds of encoding schemes to encode different characters into different sequences of 1 and 0. If a system just receives a string of those without being told what encoding they represent, the system cannot know what characters those supposedly are.
Therefore, for any interface between two system, there needs to be a specification for what encoding strings are in. For MySQL, that's the API call mysql_set_charset. This is the way to tell MySQL what encoding strings will be in that PHP sends to it, and what encoding MySQL should returns strings in back to PHP. Without setting this explicitly some default encoding is assumed, which may not be the same encoding you're expecting, creating a mismatch and garbage characters.
Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App for more information.
It's wise to always call it once connection is established, to ensure your app will not be affected by broken server settings. Because you can have your tables in i.e. UTF8 and send your data in UTF8 but if the connection is not UTF8 (because of i.e my.ini settings) then you end up with mess. So either call mysql_set_charset() or execute SET NAMES charset query, and you will be on safe ground. And since it is done once per connection, it's basically no cost operation anyway
mysql_set_charset functions sets the default character set for the current connection. Even though your data is stored in unicode on the server, it still requires a compatible connection character set to transmit data accurately.
If you execute SHOW VARIABLES LIKE 'character\_set\_%' statement in mysql it will show various sharacter sets used by the server and current connection. Ideally they should all match and be utf8.
More information: MySQL Connection Character Sets.
According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.
I have started debugging my RSS feed because it has some strange characters in it (i.e. the missing-character glyph). I started with two excellent beginner resources:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets: http://www.joelonsoftware.com/articles/Unicode.html
Character Sets / Character Encoding Issues: http://www.phpwact.org/php/i18n/charsets
The reason I believe our RSS feed is having problems is because users are copy&pasteing MS Word documents into a textarea on the site and our PHP pages are using the "iso-8859-1" charset which is incompatible with the special "Windows-1252" encodings for things like bullet points and smart quotes used by MS Word.
So I'm hoping to fix the issue, all I'll need to do is start using "utf-8" in the pages that take/give user input??. I.e. set the following in the HEAD section:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"? MySQL doesn't really care about the charset does it? It just sees a bunch of bytes and If I put Unicode into a field collated as Latin it'll still come back out as Unicode right? Changing the field will be tiresome because the field is part of a FULLTEXT index where the other fields will also need their collation changing which means dropping the index and rebuilding it (which is no small task when there's large amounts of TEXT involved).
The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"?
No. latin1_swedish_ci and utf8_general_ci are collations - not charsets. The collation won't affect the way that characters are stored or input/output. It only controls how sorting functions order their results. The collation - to work as expected - should match the storage charset. So if your tables are stored in utf8, you should use a utf8 collation.
The storage charset for mysql is not directly tied to the charset in php. You can use utf8 as the storage characterset for Mysql, while using iso-8859-1 in php. In that case, you need to tell Mysql about it, by setting the charset on the connection (set names XXX). Mysql will then convert as needed. If you don't use the same charset on Mysql and php, you'll end up with the charset capacity that is the lowest dommon denominator, so even though strings are stored in utf8, you'll not have the full unicode range of characters available. Therefore you should use utf8 in both Mysql and php.
No - definitively not. As MySQL posseses the ability to transform strings from one character set into another on the fly, it's important though that your MySQL server knows what character set you're working with on the client side (client side = PHP script, NOT the client accessing your webpage). This can be done by issuing the query
SET NAMES 'utf8';
prior to any other query you send to the server. MySQL will then do the appropriate conversions from your client character set into the internal MySQL character set into the table and/or column character set and all the way back. So generally you only have to worry about setting the correct client character set. This character set must match the character set you use to output your data to the webserver.
Please have a look at the MySQL manual:
9.1.4. Connection Character Sets and Collations
or 9.1. Character Set Support in general.
To save someone some time searching for how to change the mysql connection charset nicely with pdo/mysql here's how i do it:
$dbc = new pdo('mysql:dbname=DBNAME;host=DBHOST', $user, $pw, array(PDO::MYSQL_ATTR_INIT_COMMAND => sprintf( "SET NAMES %s", $charset ) ) );
In HTTP the character encoding is declared by the charset parameter in the Content-Type header field of the HTTP response. Other declaration are overwritten by the declaration in the HTTP header:
[…] user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
The charset attribute set on an element that designates an external resource.
Additionally you should explicitly declare the accepted character encoding with the accept-charset attribute in the form element. Otherwise the user agent may take (but must not) the character encoding used in the form document to encode the input data:
The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.
This should give you the best chance that the incoming data is encoded correctly. But it’s not guarateed. So better check if the data is acutally encoded with UTF-8 (there are functions/algorithms to do this).
I have a table of datas encoded in latin5 charset and all the columns in the table are also latin5. From mysql console when I enter "SET NAMES 'latin5'" and query the table results are ok . When I try to delete or insert/update all the new data's encodings are perfect. But when I try to insert Iso-8859 data (also verify this with mb_detect_encoding) to the database and I try to insert the data without "SET NAMES" it doesn't insert/update/select in proper encodings or when I used "SET NAMES 'latin5'" it doesn't insert/update in proper way but select are ok latin5 datas are coming in proper encodings in with only set names 'latin5'. When i use set names 'utf8' the select queries are bad encoded but insert/update are ok.
The reason I asked that we will go to production. And this makes me thinking about possible future problems.
mb_detect_encoding doesn't know what encoding your string is. It makes a qualified guess, but there are no guarantees that it will guess right. Especially not if the candidates are all single-byte encodings, as in the case of latin1 and latin5.
There really is no substitute for knowing what you're doing, if you want to get charsets right. I suggest that you read these pages at least a couple of times:
http://www.phpwact.org/php/i18n/charsets
http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet
In particular, make note that a web page is served with a http header, that specifies the charset that the page is encoded with. Unless you explicitly set this from your php-script, you'll use the webservers default, which may vary from server to server.
Also, be wary to actually understand what is going on, rather than doing trial and error. The latter can easily get you something that works in some context, but not in every context.
And lastly. If you have any choice at all, I seriously suggest that you use utf-8 for everything. latin5 is going to get you lots of grief.
It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.