Strange Encoding Problem - php

I have a table of datas encoded in latin5 charset and all the columns in the table are also latin5. From mysql console when I enter "SET NAMES 'latin5'" and query the table results are ok . When I try to delete or insert/update all the new data's encodings are perfect. But when I try to insert Iso-8859 data (also verify this with mb_detect_encoding) to the database and I try to insert the data without "SET NAMES" it doesn't insert/update/select in proper encodings or when I used "SET NAMES 'latin5'" it doesn't insert/update in proper way but select are ok latin5 datas are coming in proper encodings in with only set names 'latin5'. When i use set names 'utf8' the select queries are bad encoded but insert/update are ok.
The reason I asked that we will go to production. And this makes me thinking about possible future problems.

mb_detect_encoding doesn't know what encoding your string is. It makes a qualified guess, but there are no guarantees that it will guess right. Especially not if the candidates are all single-byte encodings, as in the case of latin1 and latin5.
There really is no substitute for knowing what you're doing, if you want to get charsets right. I suggest that you read these pages at least a couple of times:
http://www.phpwact.org/php/i18n/charsets
http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet
In particular, make note that a web page is served with a http header, that specifies the charset that the page is encoded with. Unless you explicitly set this from your php-script, you'll use the webservers default, which may vary from server to server.
Also, be wary to actually understand what is going on, rather than doing trial and error. The latter can easily get you something that works in some context, but not in every context.
And lastly. If you have any choice at all, I seriously suggest that you use utf-8 for everything. latin5 is going to get you lots of grief.

Related

PHP PDO Firebird charset never changes

Connection string is like;
firebird:dbname=PRODUCTS.GDB;charset=UTF8
But unicode characters are not correctly returned. I tried changing it to utf-8 with and without dash, small and big letters, to other charsets like ISO8859_9.. All is the same.
The problem is that you are using character set NONE for the columns. For columns with character set NONE all bets are off as Firebird is unable to transliterate to the specified connection character set and will send the data as is. The handling is specific to the client application or driver, some will apply the default system encoding, others will just assume it is in the connection character set they expected (in your case UTF-8), etc. Doing this may even lead to logical data corruption (eg because you are storing it in UTF-8 and another application is retrieving it expecting Windows-1254 or ISO-8859-9).
The fact it may display correct in another application, may be because that application assumes the stored data is in a certain character set and guesses right.
I don't know PHP, nor PDO, but a workaround might be to specify the actual character set of the data (eg WIN1254 instead of UTF8) in the connection string as this may lead to the characters being correctly converted.
However, the only real solution is to create a new database with a default character set other than NONE, execute the DDL (and specifying explicit character sets for columns that need to have a different one), and then pump the data from the old to the new database, making sure you apply the right character set conversion(s).
When this is done you will also need to ensure that all applications connecting to this database will use an explicit connection character set.

Is MySQL's collation solely used for sorting?

According to the official MySQL manual the collation used defines the order of records when sorting alphabetically:
http://dev.mysql.com/doc/refman/5.0/en/charset-general.html
However: I have a PHP script (UTF-8) and I save some foreign characters in my MySQL database it's saved all weird (first row). This is when the collation I choose is latin1_swedish_ci. When I change the collation to utf8_unicode_ci all is good (second row).
When saving this data everything is exactly the same except for the collation.
So how about that "collation is used solely for sorting records"?
How someone can clarify this for me :-) Thanks in advance!
It appears that the charset of your connection is not set right, therefore the conversion from the programming language charset to the database is not correct.
You should set the charset in your connection, then both will workfine.
as pointed out in the comments a little explanation on how things work.
when you have not set the character set in your connections, the server assumes it to be the same as the collocation of the database. when data is recieved in a another encoding, the data is written nevertheless. just with wrong or other characters than they have been in the encoding of the data from the script.
as long as nothing changes, the script gets back the same data as it has written and everything appears to be fine.
however when either the connection encoding or the database encoding is changed at this point, the already stored data gets converted to the new encoding. the problem here is that the source data is not in the encoding that is assumend when converting.
all encodings share the ascii set with the same bits, thats why ascii charactes dont mess up. only special charaters do.
so you have to set your conneciton encoding in order to dont produce the mess that you are already in.
now what can you do about the data you already have?
you can make a dump of your database using mysqldump and use the --skip-set-charset option. then you get a plaintext file. in this plane text file replace all occurences of the actual database charset with the one the data is really in (the one you had in your script when you wrote the data).
then save the file and make sure your editor does not do any conversion (i recommend vim).
then import that file and you will get a database with data in the correct encoding. then you can change the encoding however you like and as long as your conneciton charset gets set also you will be fine from now on.
also make sure that the mysql server has the charsets installed, but it should have that already.
this is only my approach, i have cleaned up a lot of messed up installations like that. most of which at some point have garbled characters in their projects (after switching server, updating or restoring a backup...).
turns out not setting the connection charset is something that is very often forgotten.

Help with multi-lingual text, php, and mysql

I have had no end of problems trying to do what I thought would be relatively simple:
I need to have a form which can accept user input text in a mix of English an other languages, some multi-byte (ie Japanese, Korean, etc), and this gets processed by php and is stored (safely, avoiding SQL injection) in a mysql database. It also needs to be accessed from the database, processed, and used on-screen.
I have it set up fine for Latin chars but when I add a mix of Latin andmulti-byte chars it turns garbled.
I have tried to do my homework but just am banging my head against a wall now.
Magic quotes is off, I have tried using utf8_encode/decode, htmlentities, addslashes/stripslashes, and (in mysql) both "utf8_general_ci" and "utf8_unicode_ci" for the field in the table.
Part of the problem is that there are so many places where I could be messing it up that I'm not sure where to begin solving the problem.
Thanks very much for any and all help with this. Ideally, if someone has working php code examples and/or knows the right mysql table format, that would be fantastic.
Here is a laundry list of things to check are in UTF8 mode:
MySQL table encoding. You seem to have already done this.
MySQL connection encoding. Do SHOW STATUS LIKE 'char%' and you will see what MySQL is using. You need character_set_client, character_set_connection and character_set_results set to utf8 which can easily set in your application by doing SET NAMES 'utf8' at the start of all connections. This is the one most people forget to check, IME.
If you use them, your CLI and terminal settings. In bash, this means LANG=(something).UTF-8.
Your source code (this is not usually a problem unless you have UTF8 constant text).
The page encoding. You seem to have this one right, too, but your browsers debug tools can help a lot.
Once you get all this right, all you will need in your app is mysql_real_escape_string().
Oh and it is (sadly) possible to successfully store correctly encoded UTf8 text in a column with the wrong encoding type or from a connection with the wrong encoding type. And it can come back "correctly", too. Until you fix all the bits that aren't UTF8, at which point it breaks.
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
First you need to check if you can add multi-language text to your database directly. If its possible you can do it in your application
Are you serializing any data by chance? PHPs serialize function has some issue when serializing non-english characters.
Everything you do should be utf-8 encoded.
One thing you could try is to json_encode() the data when putting it into the database and json_decoding() it when it's retrieved.
The problem was caused by my not having the default char set in the php.ini file, and (possibly) not having set the char set in the mysql table (in PhpMyAdmin, via the Operations tab).
Setting the default char set to "utf-8" fixed it. Thanks for the help!!
Check your database connection settings. It also needs to support UTF-8.

SET NAMES utf8 in MySQL?

I often see something similar to this below in PHP scripts using MySQL
query("SET NAMES utf8");
I have never had to do this for any project yet so I have a couple basic questions about it.
Is this something that is done with PDO only?
If it is not a PDO specific thing, then what is the purpose of doing it? I realize it is setting the encoding for mysql but I mean, I have never had to use it so why would I want to use it?
It is needed whenever you want to send data to the server having characters that cannot be represented in pure ASCII, like 'ñ' or 'ö'.
That if the MySQL instance is not configured to expect UTF-8 encoding by default from client connections (many are, depending on your location and platform.)
Read http://www.joelonsoftware.com/articles/Unicode.html in case you aren't aware how Unicode works.
Read Whether to use "SET NAMES" to see SET NAMES alternatives and what exactly is it about.
From the manual:
SET NAMES indicates what character set
the client will use to send SQL
statements to the server.
More elaborately, (and once again, gratuitously lifted from the manual):
SET NAMES indicates what character set
the client will use to send SQL
statements to the server. Thus, SET
NAMES 'cp1251' tells the server,
“future incoming messages from this
client are in character set cp1251.”
It also specifies the character set
that the server should use for sending
results back to the client. (For
example, it indicates what character
set to use for column values if you
use a SELECT statement.)
Getting encoding right is really tricky - there are too many layers:
Browser
Page
PHP
MySQL
The SQL command "SET CHARSET utf8" from PHP will ensure that the client side (PHP) will get the data in utf8, no matter how they are stored in the database. Of course, they need to be stored correctly first.
DDL definition vs. real data
Encoding defined for a table/column doesn't really mean that the data are in that encoding. If you happened to have a table defined as utf8 but stored as differtent encoding, then MySQL will treat them as utf8 and you're in trouble. Which means you have to fix this first.
What to check
You need to check in what encoding the data flow at each layer.
Check HTTP headers, headers.
Check what's really sent in body of the request.
Don't forget that MySQL has encoding almost everywhere:
Database
Tables
Columns
Server as a whole
Client
Make sure that there's the right one everywhere.
Conversion
If you receive data in e.g. windows-1250, and want to store in utf-8, then use this SQL before storing:
SET NAMES 'cp1250';
If you have data in DB as windows-1250 and want to retreive utf8, use:
SET CHARSET 'utf8';
Few more notes:
Don't rely on too "smart" tools to show the data. E.g. phpMyAdmin does (was doing when I was using it) encoding really bad. And it goes through all the layers so it's hard to find out.
Also, Internet Explorer had really stupid behavior of "guessing" the encoding based on weird rules.
Use simple editors where you can switch encoding. I recommend MySQL Workbench.
This query should be written before the query which create or update data in the database, this query looks like :
mysql_query("set names 'utf8'");
Note that you should write the encode which you are using in the header for example if you are using utf-8 you add it like this in the header or it will couse a problem with Internet Explorer
so your page looks like this
<html>
<head>
<title>page title</title>
<meta charset="UTF-8" />
</head>
<body>
<?php
mysql_query("set names 'utf8'");
$sql = "INSERT * FROM ..... ";
mysql_query($sql);
?>
</body>
</html>
The solution is
$conn->set_charset("utf8");
Instead of doing this via an SQL query use the php function: mysqli::set_charset
mysqli_set_charset
Note:
This is the preferred way to change the charset. Using mysqli_query() to set it (such as SET NAMES utf8) is not recommended.
See the MySQL character set concepts section for more information.
from http://www.php.net/manual/en/mysqli.set-charset.php
Thanks #all!
don't use: query("SET NAMES utf8"); this is setup stuff and not a query. put it right afte a connection start with setCharset() (or similar method)
some little thing in parctice:
status:
mysql server by default talks latin1
your hole app is in utf8
connection is made without any extra (so: latin1) (no SET NAMES utf8 ..., no set_charset() method/function)
Store and read data is no problem as long mysql can handle the characters.
if you look in the db you will already see there is crap in it (e.g.using phpmyadmin).
until now this is not a problem! (wrong but works often (in europe)) ..
..unless another client/programm or a changed library, which works correct, will read/save data. then you are in big trouble!
Not only PDO. If sql answer like '????' symbols, preset of you charset (hope UTF-8) really recommended:
if (!$mysqli->set_charset("utf8"))
{ printf("Can't set utf8: %s\n", $mysqli->error); }
or via procedure style mysqli_set_charset($db,"utf8")

How do I ensure that user entered data containing international characters doesn't get corrupted?

It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.

Categories