When do you need to explicitly call mysql_set_charset? - php

I am working on a website with MySQL database on a Linux server.
Using phpMyAdmin, on the database, it says
MyISAM is the default storage engine on this MySQL server
latin1_swedish_ci
However, I have created all the tables with InnoDB and utf8_unicode_ci. I also checked that the table fields for all tables is utf8_unicode_ci.
Yet, when I mysql_fetch_array, and echo to stream, it gives gibberish. I had to explicitly set mysql_set_charset('utf8') for the text to appear correctly.
PHP version is 5.3.9; MySQL version is 5.1.70-cll - MySQL Community Server (GPL).
This is the first time I encountered this problem and I never had to set charset before.
What caused the text fetched by php mysql_* to be gibberish? Under what circumstance is it necessary to mysql_set_charset?
EDIT: This is not a question to attract suggestion to use alternative library e.g. mysqli, pdo. I just want to understand about this current situtation about the behavior of MySQL and charsets. Thanks.

When exchanging data between two systems, there's always the question "what encoding will text be sent in?" "Text" is represented simply as binary data, just long strings of 1s and 0s. These could mean anything at all. There are hundreds of encoding schemes to encode different characters into different sequences of 1 and 0. If a system just receives a string of those without being told what encoding they represent, the system cannot know what characters those supposedly are.
Therefore, for any interface between two system, there needs to be a specification for what encoding strings are in. For MySQL, that's the API call mysql_set_charset. This is the way to tell MySQL what encoding strings will be in that PHP sends to it, and what encoding MySQL should returns strings in back to PHP. Without setting this explicitly some default encoding is assumed, which may not be the same encoding you're expecting, creating a mismatch and garbage characters.
Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App for more information.

It's wise to always call it once connection is established, to ensure your app will not be affected by broken server settings. Because you can have your tables in i.e. UTF8 and send your data in UTF8 but if the connection is not UTF8 (because of i.e my.ini settings) then you end up with mess. So either call mysql_set_charset() or execute SET NAMES charset query, and you will be on safe ground. And since it is done once per connection, it's basically no cost operation anyway

mysql_set_charset functions sets the default character set for the current connection. Even though your data is stored in unicode on the server, it still requires a compatible connection character set to transmit data accurately.
If you execute SHOW VARIABLES LIKE 'character\_set\_%' statement in mysql it will show various sharacter sets used by the server and current connection. Ideally they should all match and be utf8.
More information: MySQL Connection Character Sets.

Related

How does charset affect php when doing MySQL queries?

I noticed that when doing database queries in PHP (e.g. Zend_db, mysqli...), you can set the character set. For example: mysqli_set_charset($con,"utf8"); I'm a little foggy as to what this actually does behind the scenes.
If I use php to do a database SELECT query, and I indicate a character set, what happens if it's not the same character set that the column was defined as in the database?
I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"?
Or will MySQL try to convert it to the binary sequence that's the equivalent in the character set you indicated?
I guess the gist of my question is that I would like to know how the data is encoded when PHP is sending in the query, how it's transmitted back from MySQL, and whether there's another step of translation after PHP gets it back and stores it into a string in PHP internal memory.
Similarly, if you're doing an INSERT or update, how does it get sent from PHP to MySQL? Does PHP convert it to the correct binary encoding THEN send it into MySQL?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Update:
Thanks to Raymond Nijland. I was able to fix my bug. But I did notice that for nonstandard characters, the charset does seem to matter.
I did a select statement using $db = new \PDO("mysql:host=$host;dbname=$database;charset=latin1", $dbuser, $dbpassword);. First, I tried latin1, then I tried utf8.
My problem was that I had a column with encrypted data, which I guess had some wierd characters in it. if I did an md5 on that field directly in the database query, it gave me an encoding that began with 889... BUT, I tried to pulled it into PHP with a SELECT statement. If I used PDO with a charset of latin1, then did an MD5() inside of PHP, it gives me the same hash (889...). Which implies that PHP has an exact copy of the binary that's in the database. BUT if I did used PDO with charset 'UTF-8', then did an MD5() in PHP, it gave me a hash beginning with 087... So somewhere, a conversion must be taking place.
At this point, my orignal bug is fixed, but I'm still curious as to what's happening. Is MYSQL doing the conversion before returning it to PHP, or does PDO do some sort of conversion on the PHP side?
mysqli_set_charset($con,"utf8"); (or other code for other client libraries) declares to MySQL that the encoding in the client will be MySQL's CHARACTER SET utf8. If bytes with a different encoding are sent to (think INSERT) mysql, garbage or errors will occur.
That setting also declares that the client desires that encoding from SELECTs.
The CHARACTER SET on each column in each table may be something else (eg, "latin1"). If so, MySQL will attempt to convert the encoding during the transmission.
Caution: MySQL's CHARACTER SET utf8 is a subset of the well-known UTF-8. To get the latter, use CHARACTER SET utf8mb4 in tables and mysqli_set_charset($con,"utf8mb4"); when connecting.
Going forward, utf8mb4 is preferred in most situations.
Non-text stuff ("as-is") should be put into BLOB or VARBINARY columns -- this bypasses any checking of the encoding. (Think a .jpg or AES_ENCRYPT.)
MySQL's MD5() function returns a hex string. UNHEX(MD5('...')) return binary stuff and must be store in, say, a BINARY(16) column.
Many forms of garbled text are discussed in Trouble with UTF-8 characters; what I see is not what I stored .

PHP PDO Firebird charset never changes

Connection string is like;
firebird:dbname=PRODUCTS.GDB;charset=UTF8
But unicode characters are not correctly returned. I tried changing it to utf-8 with and without dash, small and big letters, to other charsets like ISO8859_9.. All is the same.
The problem is that you are using character set NONE for the columns. For columns with character set NONE all bets are off as Firebird is unable to transliterate to the specified connection character set and will send the data as is. The handling is specific to the client application or driver, some will apply the default system encoding, others will just assume it is in the connection character set they expected (in your case UTF-8), etc. Doing this may even lead to logical data corruption (eg because you are storing it in UTF-8 and another application is retrieving it expecting Windows-1254 or ISO-8859-9).
The fact it may display correct in another application, may be because that application assumes the stored data is in a certain character set and guesses right.
I don't know PHP, nor PDO, but a workaround might be to specify the actual character set of the data (eg WIN1254 instead of UTF8) in the connection string as this may lead to the characters being correctly converted.
However, the only real solution is to create a new database with a default character set other than NONE, execute the DDL (and specifying explicit character sets for columns that need to have a different one), and then pump the data from the old to the new database, making sure you apply the right character set conversion(s).
When this is done you will also need to ensure that all applications connecting to this database will use an explicit connection character set.

Is MySQL's collation solely used for sorting?

According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.

PHP/PDO/MySQL: inserting into MEDIUMBLOB stores bad data

I have a simple PHP web app that accepts icon images via file upload and stores them in a MEDIUMBLOB column.
On my machine (Windows) plus two Linux servers, this works fine. On a third Linux server, the inserted image is corrupted: unreadable after a SELECT, and the length of the column data as reported by the MySQL length() function is about 40% larger than the size of the uploaded file.
(Each server connects to a separate instance of MySQL.)
Of course, this leads me to think about encoding and character set issues. BLOB columns have no associated charsets, so it seems like the most likely culprit is PDO and its interpretation of the parameter value for that column.
I've tried using bindValue with PDO::PARAM_LOB, to no effect.
I've verified that the images are being received on the server correctly (i.e. am reading them post-upload with no problem), so it's definitely a DB/PDO issue.
I've searched for obvious configuration differences between the servers, but I'm not an expert in PHP configuration so I might have missed something.
The insert code is pretty much as follows:
$imagedata = file_get_contents($_FILES["icon"]["tmp_name"]);
$stmt = $pdo->prepare('insert into foo (theimage) values (:theimage)');
$stmt->bindValue(':theimage', $imagedata, PDO::PARAM_LOB);
$stmt->execute();
Any help will be really appreciated.
UPDATE: The default MySQL charset on the problematic server is utf8; it's latin1 on the others.
The problem is "solved" by adding PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES latin1 COLLATE latin1_general_ci" to the PDO constructor.
This seems like a bug poor design to me: why should the charset of the connection have any effect on data for a binary column, particularly when it's been identified as binary to PDO itself with PARAM_LOB?
Note that the DB tables are defined as latin1 in all cases: it's only the servers' default charsets that are inconsistent.
This seems like a bug to me: why should the charset of the connection have any effect on data for a binary column, particularly when it's been identified as binary to PDO itself with PARAM_LOB?
I do not think that this must be a bug. I can imagine that whenever the client talks with the server and says that the following command is in UTF-8 and the server needs it in Latin-1, then the query might get re-encoded prior parsing and execution. So this is an encoding issue for the transportation of the data. As the whole query prior parsing will get influenced by this re-encoding, the binary data for the BLOB column will get changed as well.
From the Mysql manual:
What character set should the server translate a statement to after receiving it?
For this, the server uses the character_set_connection and collation_connection system variables. It converts statements sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8). collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.
Or on the way back: Latin1 data from the store will get converted into UTF-8 because the client told the server that it prefers UTF-8 for the transportation.
The identifier for PDO itself you name looks like being something entirely different:
PDO::PARAM_LOB tells PDO to map the data as a stream, so that you can manipulate it using the PHP Streams API. (Ref)
I'm no MySQL expert but I would explain it this way. Client and server need to negotiate which charsets they are using and I assume they do this for a reason.

How do I ensure that user entered data containing international characters doesn't get corrupted?

It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.

Categories