Encoding troubles converting MySQL to Mongo with PHP - php

I've been having a lot of encoding troubles with PHP/Mongo in general.
Right now, I'm in the process of converting some data from MySQL to Mongo. I have a string that contains a é, but when I try to encode it to UFT-8 (via mb_convert_encoding, uft8_encode), it turns into é. I'm sure other strings also contain other accented characters.
I've tried mb_detect_encoding, which told me the string is UTF-8, but when I do mb_check_encoding($string, 'UTF-8'), it returns false.
Basically, I have no idea what's wrong. This is on a page that is just a PHP script, no HTML. Any advice to this problem, or in general maintaining character encoding when inserting into Mongo?
Here is the script in question: https://plnkr.co/edit/eAkLxfklzLNCsZTBPKsX
The MySQL table is using a MyISAM engine, charset utf8, collation utf8_unicode_ci

Do not use the mysql_* API; change to mysqli_*
Do not use any mb or utf8 encode/decode routines; they merely hide the 'proper' solution.
Right after connecting to mysql, do SET NAMES utf8.
SHOW CREATE TABLE -- verify that the table/columns are CHARACTER SET utf8 (or utf8mb4)
é is the Mojibake for é. It usually indicates a mismatch of latin1 settings and utf8 settings.
If using PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF8', $user, $pwd); or execute SET NAMES utf8.

Related

The ultimate emoji encoding scheme

This is my environment: Client -> iOS App, Server ->PHP and MySQL.
The data from client to server is done via HTTP POST.
The data from server to client is done with json.
I would like to add support for emojis or any utf8mb4 character in general. I'm looking for the right way for dealing with this under my scenario.
My questions are the following:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols? If so, which encoding method should I use so that it works flawlessly in Objective-C and PHP (and java for the future android version)?
Right now I have the DB with utf8mb4 but I get errors when trying to store a raw emoji. On the other hand, I can store non-utf8 symbols such ¿ or á.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8 (if I get them in utf8mb4 the json_decode function doesn't work), then such symbols are encoded (e.g., ¿ is encoded to \u00bf).
MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Yes.
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?
God no, use raw UTF-8 (utf8mb4) for all that is holy.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8
Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.
if I get them in utf8mb4 the json_decode function doesn't work
You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:
echo json_encode('😀');
"\ud83d\ude00"
echo json_decode('"\ud83d\ude00"');
😀
Use utf8mb4 throughout MySQL:
SET NAMES utf8mb4
Declare the table/columns CHARACTER SET utf8mb4
Emoji and certain Chinese characters will work in utf8mb4, but not in MySQL's utf8.
Use UTF-8 throughout other things:
HTML:
¿ or á are (or at least can be) encoded in utf8 (utf8mb4)

Having different server side and database charsets - CodeIgniter

I'm using a cryptographic function in PHP (mcrypt_create_iv). I saw that in my database table that the field which stores this functions return value is of the latin1_swedish_ci charset, while in CodeIgniter (config/database.php) the charset is set to utf8.
I tested keeping the charset as utf8 in CI and running the method which stores the encrypted data into the tables column, but it returned a bunch of question marks and stuff that didn't make me feel confident that the mcrypt function worked.
So I changed CIs database charset to latin1, which is the same as the field in my databases table. My DB config file now looks like:
$db['default']['char_set'] = 'latin1';
$db['default']['dbcollat'] = 'utf8_general_ci';
I was wondering if there would be any problem caused by using both latin1 and utf8? I can feel that it just doesn't look right, using two different charsets and all, but in order to use the mcrypt_create_iv function (which is used to salt passwords, a big deal imo), I resorted to doing it anyway, hoping it wouldn't affect anything (i.e. inserting/getting data back correctly).
Could someone please shed some light, I would really appreciate it. Thanks
Using charset latin but UTF collation doesn't make a lot of sense. The latin charset will turn most unicode characters into "?" since they don't exist in the indicated charset. Using collation based on characters that are not in your chosen charset won't do anything.
So: if you want to be able to store all textual data, you'll want to change your charset utf8, and use utf8_general_ci collation. If you just want latin1 exclusively (I don't know why you would, but you might...) then use collation rules for latin as well.
If you do go with utf8, you'll also want to remember to, when you set up a connect to your database, ensure the connection also uses utf8 for its charset and names, so that you don't lose text "in transport" between your server and your database.

Using utf8mb4 with php and mysql

I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4
looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php
Not only have I read things but I also made a test.
I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");
and, as expected, the characters looks messy once in the db.
Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?
Also why is utf8mb4 different from utf32?
MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.
What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.
mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.
UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.
This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.
php configurations script( api etc..)
header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mysql tables / or specific columns
utf8mb4
mysql PDO connection
$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';
(...your connection ...)
before execute query (might not be required):
$dbh->exec("set names utf8mb4");
utf-32: This is a character encoding using a fixed 4-bytes per characters
utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.
MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.
Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')
Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

What should I update in my PHP setup when I change my MySQL databases to UTF-8 encoding?

I currently operate a website on a PHP 5 and MySQL backbone. The MySQL databases uses cp1252 West Europe ( latin1 ) encoding, and latin1_swedish_cp collation.
I'd like to switch the MySQL databases to UTF-8 encoding and utf8_general_ci. I don't need help converting the content within MySQL as I'm processing that as it goes in and redoing all the content on the site. Assume I'm doing that correctly for this conversation ( even though I'm probably not ).
I know there are settings in php.ini like default_charset that default to iso-8859-1. I also know that many of PHP's string manipulation functions like strlen(), as well as regexes, will not work correctly if I'm dealing with strings that contain multi-byte UTF-8 characters, which I realize is not all characters in the UTF-8 set.
What do I need to do to PHP server side and within my webapp to deal with UTF-8 coming out of my database? What does it all do?
You will have to set-up your DB connection with :
mysql_query("SET NAMES 'utf8'");
And then replace your "regular" string functions with those from the mbstring module :
http://php.net/manual/en/book.mbstring.php
like mb_strlen, mb_substr, etc.
As well as specify UTF-8 encoding where needed, for instance in the htmlentities function :
echo htmlentities($str, ENT_QUOTES, "UTF-8");
See this function.
Also, you should save all your files with utf-8 encoding (preferably without BOM).

ext/mysql charset support vs ext/mysqli charset

I read some articles that promoted the use of the new ext/mysqli in php due to it's support of character sets. I currently use ext/mysql and use SET NAMES UTF-8 to ensure all my data is stored as utf-8. isn't that charset support in ext/mysql or am I missing something larger?
Thanks :)
SET NAMES UTF-8 does no mean the data are stored in UTF-8. That means that data is RECIEVED in UTF-8 from client and is SERVED in UTF-8 to client.
Storage encoding is set when you create a db/table/row, for example
CREATE TABLE{
...
}CHARSET=utf8;
or
CREATE DATABASE DEFAULT CHARACTER SET utf8
Read here: Mysql: latin1-> utf8. Convert characters to their multibyte equivalents
2 Lyon
mysql goes just fine.
Please check once more the encodings of the tables and rows via, for example, phpMyAdmin. Remember that setting encoding to database doesn't automatically change the encoding of tables. It's just used for a default value if table encoding is not specified

Categories