How does charset affect php when doing MySQL queries? - php

I noticed that when doing database queries in PHP (e.g. Zend_db, mysqli...), you can set the character set. For example: mysqli_set_charset($con,"utf8"); I'm a little foggy as to what this actually does behind the scenes.
If I use php to do a database SELECT query, and I indicate a character set, what happens if it's not the same character set that the column was defined as in the database?
I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"?
Or will MySQL try to convert it to the binary sequence that's the equivalent in the character set you indicated?
I guess the gist of my question is that I would like to know how the data is encoded when PHP is sending in the query, how it's transmitted back from MySQL, and whether there's another step of translation after PHP gets it back and stores it into a string in PHP internal memory.
Similarly, if you're doing an INSERT or update, how does it get sent from PHP to MySQL? Does PHP convert it to the correct binary encoding THEN send it into MySQL?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Update:
Thanks to Raymond Nijland. I was able to fix my bug. But I did notice that for nonstandard characters, the charset does seem to matter.
I did a select statement using $db = new \PDO("mysql:host=$host;dbname=$database;charset=latin1", $dbuser, $dbpassword);. First, I tried latin1, then I tried utf8.
My problem was that I had a column with encrypted data, which I guess had some wierd characters in it. if I did an md5 on that field directly in the database query, it gave me an encoding that began with 889... BUT, I tried to pulled it into PHP with a SELECT statement. If I used PDO with a charset of latin1, then did an MD5() inside of PHP, it gives me the same hash (889...). Which implies that PHP has an exact copy of the binary that's in the database. BUT if I did used PDO with charset 'UTF-8', then did an MD5() in PHP, it gave me a hash beginning with 087... So somewhere, a conversion must be taking place.
At this point, my orignal bug is fixed, but I'm still curious as to what's happening. Is MYSQL doing the conversion before returning it to PHP, or does PDO do some sort of conversion on the PHP side?

mysqli_set_charset($con,"utf8"); (or other code for other client libraries) declares to MySQL that the encoding in the client will be MySQL's CHARACTER SET utf8. If bytes with a different encoding are sent to (think INSERT) mysql, garbage or errors will occur.
That setting also declares that the client desires that encoding from SELECTs.
The CHARACTER SET on each column in each table may be something else (eg, "latin1"). If so, MySQL will attempt to convert the encoding during the transmission.
Caution: MySQL's CHARACTER SET utf8 is a subset of the well-known UTF-8. To get the latter, use CHARACTER SET utf8mb4 in tables and mysqli_set_charset($con,"utf8mb4"); when connecting.
Going forward, utf8mb4 is preferred in most situations.
Non-text stuff ("as-is") should be put into BLOB or VARBINARY columns -- this bypasses any checking of the encoding. (Think a .jpg or AES_ENCRYPT.)
MySQL's MD5() function returns a hex string. UNHEX(MD5('...')) return binary stuff and must be store in, say, a BINARY(16) column.
Many forms of garbled text are discussed in Trouble with UTF-8 characters; what I see is not what I stored .

Related

How to insert ISO-8859-1 charset values in DB using php

I'm inserting contents into table as bulk upload format (.csv).
In my CSV some columns include special characters but server can't insert special characters included rows into db and stops execution.
How can I insert these values?
I'm trying:
$field1= addslashes(trim($data[0]));
and I'm using mysqli_real_escape_string, it's working but it omits all special characters from the fields...
$field2= mysqli_real_escape_string(trim($data[1]));
Do not call any conversion routines. Simply declare that the .csv file is encoded latin1. Do that with this clause in the LOAD DATA statement:
CHARACTER SET latin1
You have not said what charset the column/table is. Please provide SHOW CREATE TABLE. The column can be either latin1 or utf8.
Also, use mysqli_set_charset to establish which charset your client code is using. (It does not need to be the same as the column/table.)
You say "omits special characters". Do you mean that data is stored into the table(s), but with strings truncated at the special characters? See "Truncated text" in Trouble with utf8 characters; what I see is not what I stored .
You say "stops execution". That seems unlikely; please provide more details, such as the SQL that caused the termination, whether other commands come be run after it, whether you needed to start (not "restart") mysqld again.
You say "special characters"; let's see some examples. You used escaping to get $field variables, but what did you then do with those variables?
MySQL should do the conversion automatically, as long as you tell it what encoding your data is using. With the mysqli extension you use the mysqli::set_charset() function.
If you app is not using ISO-8859-1 and only your current data set does, you need to declare the application encoding and convert data yourself. You have at least three functions:
mb_convert_encoding()
iconv()
utf8_encode()
In either case, you'll get SQL errors if incompatible conversions happen. It shouldn't be the case if databases uses an encoding that contains all ISO-8859-1 characters (such as UTF-8).
Note: get rid of addslashes(), it serves no purpose at all and can only corrupt your data.

When do you need to explicitly call mysql_set_charset?

I am working on a website with MySQL database on a Linux server.
Using phpMyAdmin, on the database, it says
MyISAM is the default storage engine on this MySQL server
latin1_swedish_ci
However, I have created all the tables with InnoDB and utf8_unicode_ci. I also checked that the table fields for all tables is utf8_unicode_ci.
Yet, when I mysql_fetch_array, and echo to stream, it gives gibberish. I had to explicitly set mysql_set_charset('utf8') for the text to appear correctly.
PHP version is 5.3.9; MySQL version is 5.1.70-cll - MySQL Community Server (GPL).
This is the first time I encountered this problem and I never had to set charset before.
What caused the text fetched by php mysql_* to be gibberish? Under what circumstance is it necessary to mysql_set_charset?
EDIT: This is not a question to attract suggestion to use alternative library e.g. mysqli, pdo. I just want to understand about this current situtation about the behavior of MySQL and charsets. Thanks.
When exchanging data between two systems, there's always the question "what encoding will text be sent in?" "Text" is represented simply as binary data, just long strings of 1s and 0s. These could mean anything at all. There are hundreds of encoding schemes to encode different characters into different sequences of 1 and 0. If a system just receives a string of those without being told what encoding they represent, the system cannot know what characters those supposedly are.
Therefore, for any interface between two system, there needs to be a specification for what encoding strings are in. For MySQL, that's the API call mysql_set_charset. This is the way to tell MySQL what encoding strings will be in that PHP sends to it, and what encoding MySQL should returns strings in back to PHP. Without setting this explicitly some default encoding is assumed, which may not be the same encoding you're expecting, creating a mismatch and garbage characters.
Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App for more information.
It's wise to always call it once connection is established, to ensure your app will not be affected by broken server settings. Because you can have your tables in i.e. UTF8 and send your data in UTF8 but if the connection is not UTF8 (because of i.e my.ini settings) then you end up with mess. So either call mysql_set_charset() or execute SET NAMES charset query, and you will be on safe ground. And since it is done once per connection, it's basically no cost operation anyway
mysql_set_charset functions sets the default character set for the current connection. Even though your data is stored in unicode on the server, it still requires a compatible connection character set to transmit data accurately.
If you execute SHOW VARIABLES LIKE 'character\_set\_%' statement in mysql it will show various sharacter sets used by the server and current connection. Ideally they should all match and be utf8.
More information: MySQL Connection Character Sets.

Is MySQL's collation solely used for sorting?

According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.

problems with character encoding pulled from mysql database [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
UTF-8 all the way through
okay, this is stupid that I can't figure it out.
Mysql database is set to utf8_general_ci collation. The field i'm having problems with is longtext type.
characters added to the database as &eacute or other accented characters are returning as �.
I run the output through stripslashes and i've tried both with and without html_entity_decode but can find no change in the output. What am I doing wrong?
Cheers
What character encoding does the string have that you try to insert? If it is in ISO-8859-1 you can use the PHP function utf8_encode() to encode it to UTF-8 before inserting it into the database.
http://php.net/manual/en/function.utf8-encode.php
Getting encoding right is really tricky - there are too many layers:
Browser
Page
PHP
MySQL
The SQL command "SET CHARSET utf8" from PHP will ensure that the client side (PHP) will get the data in utf8, no matter how they are stored in the database. Of course, they need to be stored correctly first.
DDL definition vs. real data
Encoding defined for a table/column doesn't really mean that the data are in that encoding. If you happened to have a table defined as utf8 but stored as differtent encoding, then MySQL will treat them as utf8 and you're in trouble. Which means you have to fix this first.
What to check
You need to check in what encoding the data flow at each layer.
Check HTTP headers, headers.
Check what's really sent in body of the request.
Don't forget that MySQL has encoding almost everywhere:
Database
Tables
Columns
Server as a whole
Client
Make sure that there's the right one everywhere.
Conversion
If you receive data in e.g. windows-1250, and want to store in utf-8, then use this SQL before storing:
SET NAMES 'cp1250';
If you have data in DB as windows-1250 and want to retreive utf8, use:
SET CHARSET 'utf8';
Last note:
Don't rely on too "smart" tools to show the data. E.g. phpMyAdmin does (was doing when I was using it) encoding really bad. And it goes through all the layers so it's hard to find out. Also, Internet Explorer had really stupid behavior of "guessing" the encoding based on weird rules. Use simple editors where you can switch encoding. Also, I recommend MySQL Workbench.

MySQL outputs Western encoding in UTF-8 PHP file

I have the following problem: on a very simple php-mysqli query:
if ( $result = $mysqli->query( $sqlquery ) )
{
$res = $result->fetch_all();
$result->close();
}
I get strings wrongly encoded as Western encoded string, although the database, the table and the column is in utf8_general_ci collation. The php script itself is utf-8 encoded and the mysql-less parts of the script get the correct encodings. So say echo "ő" works perfectly, but echo $res[0] from the previous example outputs the EF BF BD character when the file viewed in the correct UTF-8 encoding. If I manually switch the browser's encoding to Western, the mysqli sourced strings get good decoding, except for the non-western characters being replaced with "?'.
What makes it even stranger is that on my development environment this isn't happening, while on my webserver it is. The developer environment is a LAMP stack (The Uniform Server), while the webserver uses nginx.
In this case, I entered the data in the database using phpMyAdmin, and inside phpmyadmin it displays perfectly. phpMyAdmin's collation is utf-8 too. I believe that the problem must be somewhere around here, as on the same webserver, for an other site where I enter data through php (using POST) the same problem doesn't happen. On that case, the data is visible correctly both while entering and while viewing it (I mean in the php generated webpages), but the special characters are not correct in phpMyAdmin.
Can you help me start where to debug? Is it connected to php or mysql or nginx or phpMyAdmin?
Use mysqli_set_charset to change the client encoding to UTF-8 just after you connect:
$mysqli->set_charset("utf8");
The client encoding is what MySql expects your input to be in (e.g. when you insert user-supplied text to a search query) and what it gives you the results in (so it has to match your output encoding in order for echo to display things correctly).
You need to have it match the encoding of your web page to account for the two scenarios above and the encoding of the PHP source file (so that the hardcoded parts of your queries are interpreted correctly).
Update: How to convert data inserted using latin-1 to utf-8
Regarding data that have already been inserted using the wrong connection encoding there is a convenient solution to fix the problem. For each column that contains this kind of data you need to do:
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET latin1;
ALTER TABLE table_name MODIFY column_name BLOB;
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET utf8;
The placeholders table_name, column_name and existing_column_type should be replaced with the correct values from your database each time.
What this does is
Tell MySql that it needs to store data in that column in latin1. This character set contains only a small subset of utf8 so in general this conversion involves data loss, but in this specific scenario the data was already interpreted as latin1 on input so there will be no side effects. However, MySql will internally convert the byte representation of your data to match what was originally sent from PHP.
Convert the column to a binary type (BLOB) that has no associated encoding information. At this point the column will contain raw bytes that are a proper utf8 character string.
Convert the column to its previous character type, telling MySql that the raw bytes should be considered to be in utf8 encoding.
WARNING: You can only use this indiscriminate approach if the column in question contains only incorrectly inserted data. Any data that has been correctly inserted will be truncated at the first occurrence of any non-ASCII character!
Therefore it's a good idea to do it right now, before the PHP side fix goes into effect.
Use mysqli::set_charset function.
$mysqli->set_charset('utf8'); //returns false if the encoding was not valid... won't happen
http://php.net/manual/en/mysqli.set-charset.php
I haven't used mysqli for some time, but if things are the same, connections by default use the latin swedish encoding (ISO 8859 1).
I will consider your page is already using utf8 encoding by having:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Inside the <head> tag.
If you have string already on latin swedish encoding, you can use mk_convert_encoding:
http://php.net/manual/en/function.mb-convert-encoding.php
$fixedStr = mb_convert_encoding($wrongStr, 'UTF-8', 'ISO-8859-1');
iconv does something very similar: Truth be told, I don't know the difference, but here's the link to the function reference:
http://php.net/manual/en/function.iconv.php
I just realized that you might have some strings in utf8 and others in latin swedish. You can use mb_detect_encoding for that: http://php.net/manual/en/function.mb-detect-encoding.php
You can also dump the database and use iconv (cmd line) if you have it installed:
iconv -f latain -t utf-8 < currentdb.sql > fixeddb.sql

Categories