How do I insert UCS-2 data with PHP PDO into MySQL?

How do I insert UCS-2 data with PHP PDO into MySQL? - php

The manual clearly states " ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET". So how can I insert, for example, the codepoint U+2193? I am using PHP 5.3 + PDO.

If you want to use Unicode for communicating with a MySQL server, your only option is to use UTF-8.
If you're working with UCS-2 or UTF-16 strings in PHP now, you'll have to convert them to UTF-8 before trying to store them. Also note that MySQL will give you back UTF-8 if that's what you set your client character set to, so you'll need to convert query results as well if you're committed to working with UCS-2 on the PHP side. (If you're in a position to make bigger changes, you'd likely be better off simply using UTF-8 everywhere than doing all this extra conversion.)
As for storing the codepoint U+2193, no worries: UTF-8 can represent every Unicode codepoint (in this specific case, it'd be 0xE2 0x86 0x93).
Technically, this is fudging a little, since MySQL's utf8 and ucs2 character sets only cover a subset of Unicode called the Basic Multilingual Plane (BMP). The world of Unicode charsets is expanded in MySQL 5.5 to move beyond the BMP, but you still can't use ucs2, the new utf16 or utf32 charsets as client charsets, leaving you still stuck with UTF-8.

For posterity, CREATE TABLE test (encoding varchar(255) CHARACTER SET ucs2); and then INSERT INTO test VALUES (1, CHAR(0x2193));. If I then run a SELECT * FROM test I see a down arrow.

Related

Special characters in mySQL (and php) - THE BASICS

I am confused! Recently my webhotel updated php and now my old tables render special characters differently (wrongly).
Both my tables and my input/output-php-pages are set to utf-8 and since this update, also the inputs from php are treated differently; now my special characters are being utf-8-encoded as they enter the database. So since this change, when I review tables within phpMyAdmin, the old inserts have the original (non-encoded) special characters - the new posts have utf-8-encoded charcters (also special).
So what I would like to do is rewrite input and output to insert and show non-encoded characters - but I am not sure if this is possible without skipping utf-8 entirely (in php and mySQL). But is there an utf-8- way to submit non-encoded characters?
AND - perhaps more fundamentally - I need to understand what the possible downsides are. I am using Danish characters in and out and I'm not going to use any other language (for this project). So if it IS possible to insert and output non-encoded characters using utf-8 - am I then going to have unexpected/destructive issues?
I have read a lot of posts regarding php/mySQL/special characters but I haven't seen this angle on the issue yet. Hope I am not duplicating
I hope not because it has been working very nicely until the update.

Even if you are using only Danish characters, you may as well go utf8 all the way.
There are many places where the encoding needs to be stated:
The at the top of the html
The columns in the database (column CHARACTER SET defaults from table, which defaults from database)
The encoding in your PHP code.
When you CREATE TABLE, tack on DEFAULT CHARACTER SET utf8. If you have existing tables, without that, speak up; we may need to deal with them.
If you want Danish collation, the specify COLLATION utf8_danish_ci, too. Then (if I recall correctly), aa will sort after z.
(The default is utf8_general_ci, which won't do that sorting.)
Figure out what encoding you have (or can get) in your php code. If you have some text with accents in it, do this:
$hex = unpack('H*', $text);
echo implode('', $hex)
If you have utf8, å will be C3A5, for latin1 it will be E5.
Regardless of what encoding in in the tables, you must call set_charset('utf8') or set_charset('latin1') depending on what encoding is in the data in PHP. MySQL will gladly transcode between latin1 and utf8 as things are passed between PHP and MySQL. For different APIs:
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
For much more info, see http://mysql.rjweb.org/doc.php/charcoll .

Emoji Support in PHP & MySQL

I've been asked to enable Emoji support for an APP backed by a PHP API. The APP is currently iPhone only (i don't have one, but i'm assuming it has Emoji's on it?).
Anyway, i noticed the database for some reason uses latin_swedish everywhere. But since i wasn't sure if utf-8 could support the 4 byte character strings required for the full emoji range, i started googling, but couldn't realy get a full answer from the results.
So:
To support Emoji's, do the charset's/collation's need setting to utf-8 in mysql, or utf-8 mb4?
If charset needs setting to utf8mb4, what is the difference between utf8 and utf8mb4 (utf8 supports up to 4 bytes anyway doesnt it?). Does it force characters to be stored in 4 byte representations at a fixed width (assuming requiring 4x more storage space per chatacter even on the standard ascii range which would normally be 1 byte).
Can utf8 be compared to utf8mb4 in mysql queries? What if i try to do a full text search, or a where clause on a utf8mb4 charset against a utf8 column of another table?
Does PHP support 4byte strings without having to use a special library like mb_string? i.e. can i just assign $var = $_POST['text'] and do things like $emoji_var == 'xxxx' or do i have to literally change all strings in PHP to use mbstring and change all comparitors e.c.t.
Just trying to work out how much work is involved in having emoji support, and any caveats of doing so. So any help would be great.

Using utf8mb4 with php and mysql

I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4
looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php
Not only have I read things but I also made a test.
I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");
and, as expected, the characters looks messy once in the db.
Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?
Also why is utf8mb4 different from utf32?

MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.
What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.
mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.
UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.

This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.
php configurations script( api etc..)
header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mysql tables / or specific columns
utf8mb4
mysql PDO connection
$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';
(...your connection ...)
before execute query (might not be required):
$dbh->exec("set names utf8mb4");

utf-32: This is a character encoding using a fixed 4-bytes per characters
utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.
MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.

Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')
Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

Character encoding MSSQL.. ISO -> Utf-8 -> Latin-1..need reversed

We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).

If latin1 is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1 is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.

Store special characters (german) SqlServer via php

I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: Ã¼Ã¤Ã¶ÃŸ
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: Ã¼Ã¤Ã¶ÃŸ
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?

A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)

Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.