ext/mysql charset support vs ext/mysqli charset

ext/mysql charset support vs ext/mysqli charset - php

I read some articles that promoted the use of the new ext/mysqli in php due to it's support of character sets. I currently use ext/mysql and use SET NAMES UTF-8 to ensure all my data is stored as utf-8. isn't that charset support in ext/mysql or am I missing something larger?
Thanks :)

SET NAMES UTF-8 does no mean the data are stored in UTF-8. That means that data is RECIEVED in UTF-8 from client and is SERVED in UTF-8 to client.
Storage encoding is set when you create a db/table/row, for example
CREATE TABLE{
...
}CHARSET=utf8;
or
CREATE DATABASE DEFAULT CHARACTER SET utf8
Read here: Mysql: latin1-> utf8. Convert characters to their multibyte equivalents
2 Lyon
mysql goes just fine.
Please check once more the encodings of the tables and rows via, for example, phpMyAdmin. Remember that setting encoding to database doesn't automatically change the encoding of tables. It's just used for a default value if table encoding is not specified

Related

Encoding troubles converting MySQL to Mongo with PHP

I've been having a lot of encoding troubles with PHP/Mongo in general.
Right now, I'm in the process of converting some data from MySQL to Mongo. I have a string that contains a é, but when I try to encode it to UFT-8 (via mb_convert_encoding, uft8_encode), it turns into Ã©. I'm sure other strings also contain other accented characters.
I've tried mb_detect_encoding, which told me the string is UTF-8, but when I do mb_check_encoding($string, 'UTF-8'), it returns false.
Basically, I have no idea what's wrong. This is on a page that is just a PHP script, no HTML. Any advice to this problem, or in general maintaining character encoding when inserting into Mongo?
Here is the script in question: https://plnkr.co/edit/eAkLxfklzLNCsZTBPKsX
The MySQL table is using a MyISAM engine, charset utf8, collation utf8_unicode_ci

Do not use the mysql_* API; change to mysqli_*
Do not use any mb or utf8 encode/decode routines; they merely hide the 'proper' solution.
Right after connecting to mysql, do SET NAMES utf8.
SHOW CREATE TABLE -- verify that the table/columns are CHARACTER SET utf8 (or utf8mb4)
Ã© is the Mojibake for é. It usually indicates a mismatch of latin1 settings and utf8 settings.
If using PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF8', $user, $pwd); or execute SET NAMES utf8.

The ultimate emoji encoding scheme

This is my environment: Client -> iOS App, Server ->PHP and MySQL.
The data from client to server is done via HTTP POST.
The data from server to client is done with json.
I would like to add support for emojis or any utf8mb4 character in general. I'm looking for the right way for dealing with this under my scenario.
My questions are the following:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols? If so, which encoding method should I use so that it works flawlessly in Objective-C and PHP (and java for the future android version)?
Right now I have the DB with utf8mb4 but I get errors when trying to store a raw emoji. On the other hand, I can store non-utf8 symbols such ¿ or á.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8 (if I get them in utf8mb4 the json_decode function doesn't work), then such symbols are encoded (e.g., ¿ is encoded to \u00bf).

MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Yes.
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?
God no, use raw UTF-8 (utf8mb4) for all that is holy.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8
Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.
if I get them in utf8mb4 the json_decode function doesn't work
You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:
echo json_encode('😀');
"\ud83d\ude00"
echo json_decode('"\ud83d\ude00"');
😀

Use utf8mb4 throughout MySQL:
SET NAMES utf8mb4
Declare the table/columns CHARACTER SET utf8mb4
Emoji and certain Chinese characters will work in utf8mb4, but not in MySQL's utf8.
Use UTF-8 throughout other things:
HTML:
¿ or á are (or at least can be) encoded in utf8 (utf8mb4)

Using utf8mb4 with php and mysql

I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4
looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php
Not only have I read things but I also made a test.
I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");
and, as expected, the characters looks messy once in the db.
Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?
Also why is utf8mb4 different from utf32?

MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.
What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.
mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.
UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.

This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.
php configurations script( api etc..)
header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mysql tables / or specific columns
utf8mb4
mysql PDO connection
$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';
(...your connection ...)
before execute query (might not be required):
$dbh->exec("set names utf8mb4");

utf-32: This is a character encoding using a fixed 4-bytes per characters
utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.
MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.

Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')
Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

Mysql: latin1-> utf8. Convert characters to their multibyte equivalents

There was a table in latin1 and site in cp1252
I want to have table in utf8 and site in utf-8
I've done:
1) on web page: Content-Type: text/html;charset=utf-8
2) Mysql: ALTER TABLE XXX CONVERT TO CHARACTER SET utf8
_
This SQL doesn't work as I want - it doesn't convert ä & ü characters in database to their multibyte equivalents
Please Help.
Tanks

As this blog post says, using MySQL's ALTER TABLE CONVERT syntax is A Bad Idea [TM]. Export your data, convert the table and then reimport the data, as described in the blog post.
Another idea: Have you set your default client connection charset via /etc/my.cnf or mysqli::set-charset .

I've been a fool. SET NAMES was missing.
What I know now:
1) Every time the charset of a column is changed, the actual data is ALWAYS recoded! Change field to binary to see that.
2) The charset of a column is prior!, the table and db charset follow in the priority. They are used mainly for setting defaults. (not 100% sure about last sentence)
3) SET NAMES is very important. German characters can come in latin1 and be placed get correctly in utf8 table(recoded by Mysql silently) when you SET NAMES correctly. The server can send data to a web page in the encoding you desire, no matter what the table encoding is. It can be recoded for output
4) If there is a column in encoding A and a column in encoding B, and you compare them (or use LIKE), the Mysql will silently convert them so that it looks like they are in one encoding
5) Mysql is smart. It never operates with text as with bytes unless the type is binary. It always operates as characters! He wants that ё in latin1 would equal ё in utf8 if he knows the data encoding

Since you claim you now get s**t back, it suggests that the characters were modified in the database.
How are you accessing the data in mysql? If you are using a programming interface such as PHP, then you may need to tell that interface what character encoding to expect.
For example, in PHP you will need to call something like mysql_set_charset("utf8"); but it can also be done with an SQL query of SET NAMES utf8
You will then also need to make sure that whatever is displaying the text knows it is utf8 and is rendering with an appropriate encoding. For example, on a web page you would need to set the content type to utf-8. something like Content-Type: text/html;charset=utf-8

Does MySQL collation type need to match PHP page charset type?

I have started debugging my RSS feed because it has some strange characters in it (i.e. the missing-character glyph). I started with two excellent beginner resources:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets: http://www.joelonsoftware.com/articles/Unicode.html
Character Sets / Character Encoding Issues: http://www.phpwact.org/php/i18n/charsets
The reason I believe our RSS feed is having problems is because users are copy&pasteing MS Word documents into a textarea on the site and our PHP pages are using the "iso-8859-1" charset which is incompatible with the special "Windows-1252" encodings for things like bullet points and smart quotes used by MS Word.
So I'm hoping to fix the issue, all I'll need to do is start using "utf-8" in the pages that take/give user input??. I.e. set the following in the HEAD section:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"? MySQL doesn't really care about the charset does it? It just sees a bunch of bytes and If I put Unicode into a field collated as Latin it'll still come back out as Unicode right? Changing the field will be tiresome because the field is part of a FULLTEXT index where the other fields will also need their collation changing which means dropping the index and rebuilding it (which is no small task when there's large amounts of TEXT involved).

The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"?
No. latin1_swedish_ci and utf8_general_ci are collations - not charsets. The collation won't affect the way that characters are stored or input/output. It only controls how sorting functions order their results. The collation - to work as expected - should match the storage charset. So if your tables are stored in utf8, you should use a utf8 collation.
The storage charset for mysql is not directly tied to the charset in php. You can use utf8 as the storage characterset for Mysql, while using iso-8859-1 in php. In that case, you need to tell Mysql about it, by setting the charset on the connection (set names XXX). Mysql will then convert as needed. If you don't use the same charset on Mysql and php, you'll end up with the charset capacity that is the lowest dommon denominator, so even though strings are stored in utf8, you'll not have the full unicode range of characters available. Therefore you should use utf8 in both Mysql and php.

No - definitively not. As MySQL posseses the ability to transform strings from one character set into another on the fly, it's important though that your MySQL server knows what character set you're working with on the client side (client side = PHP script, NOT the client accessing your webpage). This can be done by issuing the query
SET NAMES 'utf8';
prior to any other query you send to the server. MySQL will then do the appropriate conversions from your client character set into the internal MySQL character set into the table and/or column character set and all the way back. So generally you only have to worry about setting the correct client character set. This character set must match the character set you use to output your data to the webserver.
Please have a look at the MySQL manual:
9.1.4. Connection Character Sets and Collations
or 9.1. Character Set Support in general.

To save someone some time searching for how to change the mysql connection charset nicely with pdo/mysql here's how i do it:
$dbc = new pdo('mysql:dbname=DBNAME;host=DBHOST', $user, $pw, array(PDO::MYSQL_ATTR_INIT_COMMAND => sprintf( "SET NAMES %s", $charset ) ) );

In HTTP the character encoding is declared by the charset parameter in the Content-Type header field of the HTTP response. Other declaration are overwritten by the declaration in the HTTP header:
[…] user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
The charset attribute set on an element that designates an external resource.
Additionally you should explicitly declare the accepted character encoding with the accept-charset attribute in the form element. Otherwise the user agent may take (but must not) the character encoding used in the form document to encode the input data:
The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.
This should give you the best chance that the incoming data is encoded correctly. But it’s not guarateed. So better check if the data is acutally encoded with UTF-8 (there are functions/algorithms to do this).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.