Need help querying UTF8 strings from Vertica with PHP ODBC driver - php

I've been having some trouble figuring out the best way to handle UTF8 characters in PHP. I'm able to load UTF8 data (chinese characters) into Vertica just fine, and can see them there when using a JDBC client, so I know the data is being recorded correctly.
However, when I query via PHP, strings that contain UTF8 characters come through as nulls. However, I can do something like wrap the UTF8 field in a URI_PERCENT_ENCODE function, then do a urldecode on the data in PHP, which outputs the characters correctly.
Are there any ODBC driver settings, or PHP settings that you can recommend to handle UTF8 more gracefully?
We are running PHP 5.3, 64 bits.

For whatever it's worth, when working with the Vertica 64-bit ODBC for Windows and calling SQLDescribeColW to describe a table with Chinese name and Chinese column names (i.e. describing an SQL statement like 'select * from mytable'), the names returned encoded in "funky UTF-8".
The "funky UTF-8" or FUTF-8 encoding uses wchar_t[] (on Windows it is an array of 16-bit values) where in each entry in the array, there is a single real-UTF-8 byte.
For example, if the column name was "时髦" whose UTF-16 encoding is 65f6h,9ae6h (two characters, 16 bits each) and its UTF-8 encoding is e6h, 97h, b6h, e9h, abh, a6h (two characters, 3 bytes each) then in FUTF-8 you'd get: 00e6h, 0097h, 00b6h, 00e9h, 00abh, 00a6h (6 characters, 16 bits each).
I guess that this is what puts in null for PHP. I'd call it a bug of the ODBC driver.

Related

MySQL save emoji's to innoDB table [duplicate]

We are developing an iPhone app that would send emoticons from iPhone to server-side PHP and insert into MySQL tables. I am doing the server-side work.
But after insert statement executed successfully, the inserted value become blank.
What I could insert into the field(varchar) correctly is text, but once including emoticons,
just the text could be inserted and the emoticons would be cut automatically.
Someone give me advice about set the field type to Blog so that it could store image data.
But the inserted value is not always including emoticons case and size is small.
*I am using mysql_real_escape_string for inserting value.
Most iOS emojis use code points above the Basic Multilingual Plane of the Unicode table. For example, 😄 (SMILING FACE WITH OPEN MOUTH AND SMILING EYES) is at U+1F604.
Now, see http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html.
MySQL before version 5.5 only supports UTF-8 for the BMP, which includes characters between U+0000 and U+FFFF (i.e. only a subset of actual UTF-8; MySQL's utf8 is not real UTF-8). It cannot store the character at code point U+1F604 or other similar "high characters". MySQL 5.5+ supports utf8mb4 (actual UTF-8), utf16 and utf32, which are able to encode these characters. If you're using MySQL 5.5+, use one of these column character sets and make sure you're using the same charset for your connection encoding to/from PHP. If you are on MySQL < 5.5, you'll have to use a BLOB column type. That type stores raw bytes without caring about the "characters" in it. The downside is that you won't be able to efficiently search or index the text.
Some of the emoji characters work with older non-blobed mysql configurations because they are encoded using a 3 byte codepoint and mysql can store a 3 byte character. If you cannot upgrade mysql nor use blobs for whatever reason, you can scrub out 4 byte codepoints and keep the 3 byte ones.
If your computer has emoji capabilities, here is a list of the 3 byte iOS emoji characters:
☺❤✨❕❔✊✌✋☝☀☔☁⛄⚡☎➿✂⚽⚾⛳♠♥♣♦〽☕⛪⛺⛲⛵✈⛽⚠♨1⃣2⃣3⃣4⃣5⃣6⃣7⃣8⃣9⃣0⃣#⃣⬆⬇⬅➡↗↖↘↙◀▶⏪⏩♿㊙㊗✳✴♈♉♊♋♌♍♎♏♐♑♒♓⛎⭕❌©®™

DB2 iSeries PDO VARCHAR

I have successfully installed the ODBC iseries driver on a linux box. And I am calling into a DB2 iseries(6). Everything is running smoothly until I try to pull data from a column CDESC VARCHAR(3000). When the characters are below 255 I get no issues, but when it is over 255 the query fails and breaks the app. The data in the table is well over 255, but I just cannot pull it back out. I have tried CAST(CDESC AS TEXT) AS DESC, but this does not work. Any thoughts on the driver settings or changing the column type? Thanks in advance
VARCHARis a data type for single-byte character set data [SBCS], not double-byte [storesDBCS]. As such, it is impossible that it could store characters over 255.
If you need to support double byte characters, you might look at NVARCHAR which handles Unicode character sets.
Perhaps the issue is in translating to your character set. Remember that DB2 for i stores SBCS data in EBCDIC based character sets, not ASCII related ones. What CCSID's are you using on your end, and what is the data stored in?

Emoji Support in PHP & MySQL

I've been asked to enable Emoji support for an APP backed by a PHP API. The APP is currently iPhone only (i don't have one, but i'm assuming it has Emoji's on it?).
Anyway, i noticed the database for some reason uses latin_swedish everywhere. But since i wasn't sure if utf-8 could support the 4 byte character strings required for the full emoji range, i started googling, but couldn't realy get a full answer from the results.
So:
To support Emoji's, do the charset's/collation's need setting to utf-8 in mysql, or utf-8 mb4?
If charset needs setting to utf8mb4, what is the difference between utf8 and utf8mb4 (utf8 supports up to 4 bytes anyway doesnt it?). Does it force characters to be stored in 4 byte representations at a fixed width (assuming requiring 4x more storage space per chatacter even on the standard ascii range which would normally be 1 byte).
Can utf8 be compared to utf8mb4 in mysql queries? What if i try to do a full text search, or a where clause on a utf8mb4 charset against a utf8 column of another table?
Does PHP support 4byte strings without having to use a special library like mb_string? i.e. can i just assign $var = $_POST['text'] and do things like $emoji_var == 'xxxx' or do i have to literally change all strings in PHP to use mbstring and change all comparitors e.c.t.
Just trying to work out how much work is involved in having emoji support, and any caveats of doing so. So any help would be great.

Store special characters (german) SqlServer via php

I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: üäöß
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: üäöß
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?
A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)
Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.

iPhone emoticons insert into MySQL but become blank value

We are developing an iPhone app that would send emoticons from iPhone to server-side PHP and insert into MySQL tables. I am doing the server-side work.
But after insert statement executed successfully, the inserted value become blank.
What I could insert into the field(varchar) correctly is text, but once including emoticons,
just the text could be inserted and the emoticons would be cut automatically.
Someone give me advice about set the field type to Blog so that it could store image data.
But the inserted value is not always including emoticons case and size is small.
*I am using mysql_real_escape_string for inserting value.
Most iOS emojis use code points above the Basic Multilingual Plane of the Unicode table. For example, 😄 (SMILING FACE WITH OPEN MOUTH AND SMILING EYES) is at U+1F604.
Now, see http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html.
MySQL before version 5.5 only supports UTF-8 for the BMP, which includes characters between U+0000 and U+FFFF (i.e. only a subset of actual UTF-8; MySQL's utf8 is not real UTF-8). It cannot store the character at code point U+1F604 or other similar "high characters". MySQL 5.5+ supports utf8mb4 (actual UTF-8), utf16 and utf32, which are able to encode these characters. If you're using MySQL 5.5+, use one of these column character sets and make sure you're using the same charset for your connection encoding to/from PHP. If you are on MySQL < 5.5, you'll have to use a BLOB column type. That type stores raw bytes without caring about the "characters" in it. The downside is that you won't be able to efficiently search or index the text.
Some of the emoji characters work with older non-blobed mysql configurations because they are encoded using a 3 byte codepoint and mysql can store a 3 byte character. If you cannot upgrade mysql nor use blobs for whatever reason, you can scrub out 4 byte codepoints and keep the 3 byte ones.
If your computer has emoji capabilities, here is a list of the 3 byte iOS emoji characters:
☺❤✨❕❔✊✌✋☝☀☔☁⛄⚡☎➿✂⚽⚾⛳♠♥♣♦〽☕⛪⛺⛲⛵✈⛽⚠♨1⃣2⃣3⃣4⃣5⃣6⃣7⃣8⃣9⃣0⃣#⃣⬆⬇⬅➡↗↖↘↙◀▶⏪⏩♿㊙㊗✳✴♈♉♊♋♌♍♎♏♐♑♒♓⛎⭕❌©®™

Categories