Emoji Support in PHP & MySQL - php

I've been asked to enable Emoji support for an APP backed by a PHP API. The APP is currently iPhone only (i don't have one, but i'm assuming it has Emoji's on it?).
Anyway, i noticed the database for some reason uses latin_swedish everywhere. But since i wasn't sure if utf-8 could support the 4 byte character strings required for the full emoji range, i started googling, but couldn't realy get a full answer from the results.
So:
To support Emoji's, do the charset's/collation's need setting to utf-8 in mysql, or utf-8 mb4?
If charset needs setting to utf8mb4, what is the difference between utf8 and utf8mb4 (utf8 supports up to 4 bytes anyway doesnt it?). Does it force characters to be stored in 4 byte representations at a fixed width (assuming requiring 4x more storage space per chatacter even on the standard ascii range which would normally be 1 byte).
Can utf8 be compared to utf8mb4 in mysql queries? What if i try to do a full text search, or a where clause on a utf8mb4 charset against a utf8 column of another table?
Does PHP support 4byte strings without having to use a special library like mb_string? i.e. can i just assign $var = $_POST['text'] and do things like $emoji_var == 'xxxx' or do i have to literally change all strings in PHP to use mbstring and change all comparitors e.c.t.
Just trying to work out how much work is involved in having emoji support, and any caveats of doing so. So any help would be great.

Related

Handling mixed utf8 and utf8mb4 MYSQLI & PHP

Just now, I ran into a problem that I by mere chance had not encountered before:
In order to support emoji's in specific columns, I had decided to set my mysqli_set_charset() to utf8_mb4 and a few columns encoding within my database as well.
Now, I ran into the problem of PHP actually not correctly handling accented characters coming from normal utf8 encoded fields.
Now, I'm stuck with having mixed utf8 and utf8mb4 results. Since my data-handling is not very strong (used to work frameworks that handled it all for me) I'm quite unfamiliar with how I could best resolve this.
I have thought about the following options:
1 ) set my entire database to utf8mb4 collation instead of utf8 with a few exceptions.
2 ) use mysqli_set_charset() to change it, and simply make sure the queries getting said data are seperated
Now, neither of these seem like great ideas to me, but I can't really think of any better solution.
So then there's the remaining questions:
Will setting my entire db to utf8mb4 instead of utf8 be a big performance change? I do realise that utf8mb4 is bigger and therefore slower, which is why I tried to only use it on the columns in question in the first place.
Is there a way for me to simply have PHP handle utf8 encoding well, even when the mysqli_charset is onutf8mb4?
Do you have a better idea?
I'm at a real loss on this subject and I honestly can't guess which option is best. Googling on it didn't help too much as it only returned links explaining the differences of it or on how to convert your database to utf8mb4, so I would very much love to hear the thoughts on this of one of the wise SO colleagues!
Columns in this specific case:
My response including PHP's character encoding detection:
arri�n = UTF-8
bolsward = ASCII
go�nga = UTF-8
lo�nga = UTF-8
echt = ASCII
echteld = ASCII
echten (drenthe) = ASCII
echten (friesland) = ASCII
echtenerbrug = ASCII
echterbosch = ASCII
My MYSQLI charset:
mysqli_set_charset($this->getConn(), "utf8mb4");
-- and I just realised the problem was with my mysqli_set_charset. there indeed used to be an underscore in there...
It is spelled utf8mb4 (no underscore).
See Trouble with utf8 characters; what I see is not what I stored .
In particular, read "Overview of what you should do" in the answer.
You do not need to change the entire db. It is fine to specify utf8mb4 for only selected columns.
You do need to use utf8mb4 for the connection, but you specify 'UTF-8', which is the outside world's equivalent of MySQL's utf8mb4. MySQL's utf8 is a subset of utf8mb4. (Note: I am being precise in use of hyphens and underscores.)
utf8mb4 is not bigger, nor slower for transferring characters that are in common between utf8mb4 and the utf8 subset. Emoji are 4 bytes, so they are bigger than most other characters, but you are stuck with them being 4 bytes; don't sweat it.

Need help querying UTF8 strings from Vertica with PHP ODBC driver

I've been having some trouble figuring out the best way to handle UTF8 characters in PHP. I'm able to load UTF8 data (chinese characters) into Vertica just fine, and can see them there when using a JDBC client, so I know the data is being recorded correctly.
However, when I query via PHP, strings that contain UTF8 characters come through as nulls. However, I can do something like wrap the UTF8 field in a URI_PERCENT_ENCODE function, then do a urldecode on the data in PHP, which outputs the characters correctly.
Are there any ODBC driver settings, or PHP settings that you can recommend to handle UTF8 more gracefully?
We are running PHP 5.3, 64 bits.
For whatever it's worth, when working with the Vertica 64-bit ODBC for Windows and calling SQLDescribeColW to describe a table with Chinese name and Chinese column names (i.e. describing an SQL statement like 'select * from mytable'), the names returned encoded in "funky UTF-8".
The "funky UTF-8" or FUTF-8 encoding uses wchar_t[] (on Windows it is an array of 16-bit values) where in each entry in the array, there is a single real-UTF-8 byte.
For example, if the column name was "时髦" whose UTF-16 encoding is 65f6h,9ae6h (two characters, 16 bits each) and its UTF-8 encoding is e6h, 97h, b6h, e9h, abh, a6h (two characters, 3 bytes each) then in FUTF-8 you'd get: 00e6h, 0097h, 00b6h, 00e9h, 00abh, 00a6h (6 characters, 16 bits each).
I guess that this is what puts in null for PHP. I'd call it a bug of the ODBC driver.

Using utf8mb4 with php and mysql

I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4
looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php
Not only have I read things but I also made a test.
I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");
and, as expected, the characters looks messy once in the db.
Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?
Also why is utf8mb4 different from utf32?
MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.
What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.
mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.
UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.
This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.
php configurations script( api etc..)
header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mysql tables / or specific columns
utf8mb4
mysql PDO connection
$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';
(...your connection ...)
before execute query (might not be required):
$dbh->exec("set names utf8mb4");
utf-32: This is a character encoding using a fixed 4-bytes per characters
utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.
MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.
Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')
Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

Store special characters (german) SqlServer via php

I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: üäöß
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: üäöß
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?
A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)
Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.

How do I insert UCS-2 data with PHP PDO into MySQL?

The manual clearly states " ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET". So how can I insert, for example, the codepoint U+2193? I am using PHP 5.3 + PDO.
If you want to use Unicode for communicating with a MySQL server, your only option is to use UTF-8.
If you're working with UCS-2 or UTF-16 strings in PHP now, you'll have to convert them to UTF-8 before trying to store them. Also note that MySQL will give you back UTF-8 if that's what you set your client character set to, so you'll need to convert query results as well if you're committed to working with UCS-2 on the PHP side. (If you're in a position to make bigger changes, you'd likely be better off simply using UTF-8 everywhere than doing all this extra conversion.)
As for storing the codepoint U+2193, no worries: UTF-8 can represent every Unicode codepoint (in this specific case, it'd be 0xE2 0x86 0x93).
Technically, this is fudging a little, since MySQL's utf8 and ucs2 character sets only cover a subset of Unicode called the Basic Multilingual Plane (BMP). The world of Unicode charsets is expanded in MySQL 5.5 to move beyond the BMP, but you still can't use ucs2, the new utf16 or utf32 charsets as client charsets, leaving you still stuck with UTF-8.
For posterity, CREATE TABLE test (encoding varchar(255) CHARACTER SET ucs2); and then INSERT INTO test VALUES (1, CHAR(0x2193));. If I then run a SELECT * FROM test I see a down arrow.

Categories