PHP/PDO/MySQL: inserting into MEDIUMBLOB stores bad data - php

I have a simple PHP web app that accepts icon images via file upload and stores them in a MEDIUMBLOB column.
On my machine (Windows) plus two Linux servers, this works fine. On a third Linux server, the inserted image is corrupted: unreadable after a SELECT, and the length of the column data as reported by the MySQL length() function is about 40% larger than the size of the uploaded file.
(Each server connects to a separate instance of MySQL.)
Of course, this leads me to think about encoding and character set issues. BLOB columns have no associated charsets, so it seems like the most likely culprit is PDO and its interpretation of the parameter value for that column.
I've tried using bindValue with PDO::PARAM_LOB, to no effect.
I've verified that the images are being received on the server correctly (i.e. am reading them post-upload with no problem), so it's definitely a DB/PDO issue.
I've searched for obvious configuration differences between the servers, but I'm not an expert in PHP configuration so I might have missed something.
The insert code is pretty much as follows:
$imagedata = file_get_contents($_FILES["icon"]["tmp_name"]);
$stmt = $pdo->prepare('insert into foo (theimage) values (:theimage)');
$stmt->bindValue(':theimage', $imagedata, PDO::PARAM_LOB);
$stmt->execute();
Any help will be really appreciated.
UPDATE: The default MySQL charset on the problematic server is utf8; it's latin1 on the others.
The problem is "solved" by adding PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES latin1 COLLATE latin1_general_ci" to the PDO constructor.
This seems like a bug poor design to me: why should the charset of the connection have any effect on data for a binary column, particularly when it's been identified as binary to PDO itself with PARAM_LOB?
Note that the DB tables are defined as latin1 in all cases: it's only the servers' default charsets that are inconsistent.

This seems like a bug to me: why should the charset of the connection have any effect on data for a binary column, particularly when it's been identified as binary to PDO itself with PARAM_LOB?
I do not think that this must be a bug. I can imagine that whenever the client talks with the server and says that the following command is in UTF-8 and the server needs it in Latin-1, then the query might get re-encoded prior parsing and execution. So this is an encoding issue for the transportation of the data. As the whole query prior parsing will get influenced by this re-encoding, the binary data for the BLOB column will get changed as well.
From the Mysql manual:
What character set should the server translate a statement to after receiving it?
For this, the server uses the character_set_connection and collation_connection system variables. It converts statements sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8). collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.
Or on the way back: Latin1 data from the store will get converted into UTF-8 because the client told the server that it prefers UTF-8 for the transportation.
The identifier for PDO itself you name looks like being something entirely different:
PDO::PARAM_LOB tells PDO to map the data as a stream, so that you can manipulate it using the PHP Streams API. (Ref)
I'm no MySQL expert but I would explain it this way. Client and server need to negotiate which charsets they are using and I assume they do this for a reason.

Related

How does charset affect php when doing MySQL queries?

I noticed that when doing database queries in PHP (e.g. Zend_db, mysqli...), you can set the character set. For example: mysqli_set_charset($con,"utf8"); I'm a little foggy as to what this actually does behind the scenes.
If I use php to do a database SELECT query, and I indicate a character set, what happens if it's not the same character set that the column was defined as in the database?
I mean, the database returns a binary sequence, but what is actually returned if the character is not encoded the same in the two character sets? Will mySQL take the internal binary data and return it "As-is"?
Or will MySQL try to convert it to the binary sequence that's the equivalent in the character set you indicated?
I guess the gist of my question is that I would like to know how the data is encoded when PHP is sending in the query, how it's transmitted back from MySQL, and whether there's another step of translation after PHP gets it back and stores it into a string in PHP internal memory.
Similarly, if you're doing an INSERT or update, how does it get sent from PHP to MySQL? Does PHP convert it to the correct binary encoding THEN send it into MySQL?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Update:
Thanks to Raymond Nijland. I was able to fix my bug. But I did notice that for nonstandard characters, the charset does seem to matter.
I did a select statement using $db = new \PDO("mysql:host=$host;dbname=$database;charset=latin1", $dbuser, $dbpassword);. First, I tried latin1, then I tried utf8.
My problem was that I had a column with encrypted data, which I guess had some wierd characters in it. if I did an md5 on that field directly in the database query, it gave me an encoding that began with 889... BUT, I tried to pulled it into PHP with a SELECT statement. If I used PDO with a charset of latin1, then did an MD5() inside of PHP, it gives me the same hash (889...). Which implies that PHP has an exact copy of the binary that's in the database. BUT if I did used PDO with charset 'UTF-8', then did an MD5() in PHP, it gave me a hash beginning with 087... So somewhere, a conversion must be taking place.
At this point, my orignal bug is fixed, but I'm still curious as to what's happening. Is MYSQL doing the conversion before returning it to PHP, or does PDO do some sort of conversion on the PHP side?
mysqli_set_charset($con,"utf8"); (or other code for other client libraries) declares to MySQL that the encoding in the client will be MySQL's CHARACTER SET utf8. If bytes with a different encoding are sent to (think INSERT) mysql, garbage or errors will occur.
That setting also declares that the client desires that encoding from SELECTs.
The CHARACTER SET on each column in each table may be something else (eg, "latin1"). If so, MySQL will attempt to convert the encoding during the transmission.
Caution: MySQL's CHARACTER SET utf8 is a subset of the well-known UTF-8. To get the latter, use CHARACTER SET utf8mb4 in tables and mysqli_set_charset($con,"utf8mb4"); when connecting.
Going forward, utf8mb4 is preferred in most situations.
Non-text stuff ("as-is") should be put into BLOB or VARBINARY columns -- this bypasses any checking of the encoding. (Think a .jpg or AES_ENCRYPT.)
MySQL's MD5() function returns a hex string. UNHEX(MD5('...')) return binary stuff and must be store in, say, a BINARY(16) column.
Many forms of garbled text are discussed in Trouble with UTF-8 characters; what I see is not what I stored .

PHP PDO Firebird charset never changes

Connection string is like;
firebird:dbname=PRODUCTS.GDB;charset=UTF8
But unicode characters are not correctly returned. I tried changing it to utf-8 with and without dash, small and big letters, to other charsets like ISO8859_9.. All is the same.
The problem is that you are using character set NONE for the columns. For columns with character set NONE all bets are off as Firebird is unable to transliterate to the specified connection character set and will send the data as is. The handling is specific to the client application or driver, some will apply the default system encoding, others will just assume it is in the connection character set they expected (in your case UTF-8), etc. Doing this may even lead to logical data corruption (eg because you are storing it in UTF-8 and another application is retrieving it expecting Windows-1254 or ISO-8859-9).
The fact it may display correct in another application, may be because that application assumes the stored data is in a certain character set and guesses right.
I don't know PHP, nor PDO, but a workaround might be to specify the actual character set of the data (eg WIN1254 instead of UTF8) in the connection string as this may lead to the characters being correctly converted.
However, the only real solution is to create a new database with a default character set other than NONE, execute the DDL (and specifying explicit character sets for columns that need to have a different one), and then pump the data from the old to the new database, making sure you apply the right character set conversion(s).
When this is done you will also need to ensure that all applications connecting to this database will use an explicit connection character set.

When do you need to explicitly call mysql_set_charset?

I am working on a website with MySQL database on a Linux server.
Using phpMyAdmin, on the database, it says
MyISAM is the default storage engine on this MySQL server
latin1_swedish_ci
However, I have created all the tables with InnoDB and utf8_unicode_ci. I also checked that the table fields for all tables is utf8_unicode_ci.
Yet, when I mysql_fetch_array, and echo to stream, it gives gibberish. I had to explicitly set mysql_set_charset('utf8') for the text to appear correctly.
PHP version is 5.3.9; MySQL version is 5.1.70-cll - MySQL Community Server (GPL).
This is the first time I encountered this problem and I never had to set charset before.
What caused the text fetched by php mysql_* to be gibberish? Under what circumstance is it necessary to mysql_set_charset?
EDIT: This is not a question to attract suggestion to use alternative library e.g. mysqli, pdo. I just want to understand about this current situtation about the behavior of MySQL and charsets. Thanks.
When exchanging data between two systems, there's always the question "what encoding will text be sent in?" "Text" is represented simply as binary data, just long strings of 1s and 0s. These could mean anything at all. There are hundreds of encoding schemes to encode different characters into different sequences of 1 and 0. If a system just receives a string of those without being told what encoding they represent, the system cannot know what characters those supposedly are.
Therefore, for any interface between two system, there needs to be a specification for what encoding strings are in. For MySQL, that's the API call mysql_set_charset. This is the way to tell MySQL what encoding strings will be in that PHP sends to it, and what encoding MySQL should returns strings in back to PHP. Without setting this explicitly some default encoding is assumed, which may not be the same encoding you're expecting, creating a mismatch and garbage characters.
Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App for more information.
It's wise to always call it once connection is established, to ensure your app will not be affected by broken server settings. Because you can have your tables in i.e. UTF8 and send your data in UTF8 but if the connection is not UTF8 (because of i.e my.ini settings) then you end up with mess. So either call mysql_set_charset() or execute SET NAMES charset query, and you will be on safe ground. And since it is done once per connection, it's basically no cost operation anyway
mysql_set_charset functions sets the default character set for the current connection. Even though your data is stored in unicode on the server, it still requires a compatible connection character set to transmit data accurately.
If you execute SHOW VARIABLES LIKE 'character\_set\_%' statement in mysql it will show various sharacter sets used by the server and current connection. Ideally they should all match and be utf8.
More information: MySQL Connection Character Sets.

Php/ODBC encoding problem

I use ODBC to connect to SQL Server from PHP.
In PHP I read some string (nvarchar column) data from SQL Server and then want to insert it to mysql database. When I try to insert such value to mysql database table I get this mysql error:
Incorrect string value: '\xB3\xB9ow...' for column 'name' at row 1
For string with all ASCII characters everything is fine, the problem occurs when non-ASCII characters (from some European languages) exist.
So, in more general terms: there is a Unicode string in MS SQL Server database, which is retrieved by PHP trough ODBC. Then it is put in sql insert query (as value for utf-8 varchar column) which is executed for mysql database.
Can someone explain to me what is happening in this situation in terms of encoding? At which step what character encoding convertions may take place?
I use: PHP 5.2.5, MySQL5.0.45-community-nt, MS Sql Server 2005.
PHP have to run on Linux platform.
UPDATE: The error doesn't occur when I call utf8_encode($s) on this string and use that value in mysql insert query, but then the inserted string doesn't display correctly in mysql database (so that utf8 encoding only worked for enforcing proper utf8 string, but it loses correct characters).
First you have the encoding of the DB. Then you have the encoding used by the ODBC client.
If the encoding of your ODBC client connection does not match the one of the DB, the ODBC layer will automatically transcode your data, in some cases.
The trick here is to force the encoding of the ODBC client connection.
For an "all UTF-8" setup :
$conn=odbc_connect(DB_DSN,DB_USR,DB_PWD);
odbc_exec($conn, "SET NAMES 'UTF8'");
odbc_exec($conn, "SET client_encoding='UTF-8'");
// processing here
This works perfectly with PostgreSQL + Php 5.x.
The exact syntax and options depends on the DB vendor.
You can find very useful and clear additional info for MySql here : http://dev.mysql.com/doc/refman/5.0/fr/charset-connection.html
hope this helps.
Maybe you can use the PDO extension, if it will make any difference?
There is a user contributed comment here that suggests to change the data types in sql server to somethig else, if this is not possible look at the users class that casts fields.
I have no experience with ODBC via PHP, but with the mysql functions PHP seems to default to ASCII and UTF8 connections need to be made explicit if you want to avoid trouble.
Are you sure PHP and the MySQL server are communicating in UTF8? Until PHP 6 the Unicode support tends to be annoyingly inconistent like that.
I remember that the MySQL docs mention a connection string parameter to tweak the Unicode encoding.
From your description it sounds like PHP is treating the connection as ASCII-only.

SET NAMES utf8 in MySQL?

I often see something similar to this below in PHP scripts using MySQL
query("SET NAMES utf8");
I have never had to do this for any project yet so I have a couple basic questions about it.
Is this something that is done with PDO only?
If it is not a PDO specific thing, then what is the purpose of doing it? I realize it is setting the encoding for mysql but I mean, I have never had to use it so why would I want to use it?
It is needed whenever you want to send data to the server having characters that cannot be represented in pure ASCII, like 'ñ' or 'ö'.
That if the MySQL instance is not configured to expect UTF-8 encoding by default from client connections (many are, depending on your location and platform.)
Read http://www.joelonsoftware.com/articles/Unicode.html in case you aren't aware how Unicode works.
Read Whether to use "SET NAMES" to see SET NAMES alternatives and what exactly is it about.
From the manual:
SET NAMES indicates what character set
the client will use to send SQL
statements to the server.
More elaborately, (and once again, gratuitously lifted from the manual):
SET NAMES indicates what character set
the client will use to send SQL
statements to the server. Thus, SET
NAMES 'cp1251' tells the server,
“future incoming messages from this
client are in character set cp1251.”
It also specifies the character set
that the server should use for sending
results back to the client. (For
example, it indicates what character
set to use for column values if you
use a SELECT statement.)
Getting encoding right is really tricky - there are too many layers:
Browser
Page
PHP
MySQL
The SQL command "SET CHARSET utf8" from PHP will ensure that the client side (PHP) will get the data in utf8, no matter how they are stored in the database. Of course, they need to be stored correctly first.
DDL definition vs. real data
Encoding defined for a table/column doesn't really mean that the data are in that encoding. If you happened to have a table defined as utf8 but stored as differtent encoding, then MySQL will treat them as utf8 and you're in trouble. Which means you have to fix this first.
What to check
You need to check in what encoding the data flow at each layer.
Check HTTP headers, headers.
Check what's really sent in body of the request.
Don't forget that MySQL has encoding almost everywhere:
Database
Tables
Columns
Server as a whole
Client
Make sure that there's the right one everywhere.
Conversion
If you receive data in e.g. windows-1250, and want to store in utf-8, then use this SQL before storing:
SET NAMES 'cp1250';
If you have data in DB as windows-1250 and want to retreive utf8, use:
SET CHARSET 'utf8';
Few more notes:
Don't rely on too "smart" tools to show the data. E.g. phpMyAdmin does (was doing when I was using it) encoding really bad. And it goes through all the layers so it's hard to find out.
Also, Internet Explorer had really stupid behavior of "guessing" the encoding based on weird rules.
Use simple editors where you can switch encoding. I recommend MySQL Workbench.
This query should be written before the query which create or update data in the database, this query looks like :
mysql_query("set names 'utf8'");
Note that you should write the encode which you are using in the header for example if you are using utf-8 you add it like this in the header or it will couse a problem with Internet Explorer
so your page looks like this
<html>
<head>
<title>page title</title>
<meta charset="UTF-8" />
</head>
<body>
<?php
mysql_query("set names 'utf8'");
$sql = "INSERT * FROM ..... ";
mysql_query($sql);
?>
</body>
</html>
The solution is
$conn->set_charset("utf8");
Instead of doing this via an SQL query use the php function: mysqli::set_charset
mysqli_set_charset
Note:
This is the preferred way to change the charset. Using mysqli_query() to set it (such as SET NAMES utf8) is not recommended.
See the MySQL character set concepts section for more information.
from http://www.php.net/manual/en/mysqli.set-charset.php
Thanks #all!
don't use: query("SET NAMES utf8"); this is setup stuff and not a query. put it right afte a connection start with setCharset() (or similar method)
some little thing in parctice:
status:
mysql server by default talks latin1
your hole app is in utf8
connection is made without any extra (so: latin1) (no SET NAMES utf8 ..., no set_charset() method/function)
Store and read data is no problem as long mysql can handle the characters.
if you look in the db you will already see there is crap in it (e.g.using phpmyadmin).
until now this is not a problem! (wrong but works often (in europe)) ..
..unless another client/programm or a changed library, which works correct, will read/save data. then you are in big trouble!
Not only PDO. If sql answer like '????' symbols, preset of you charset (hope UTF-8) really recommended:
if (!$mysqli->set_charset("utf8"))
{ printf("Can't set utf8: %s\n", $mysqli->error); }
or via procedure style mysqli_set_charset($db,"utf8")

Categories