PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding - php

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:
pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"
I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").
When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?
(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)

Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.
You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".

BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.
The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

Related

Charsets and Databases

My boss likes to use n-dashes. They always cause problems with encoding and I cannot work out why.
I store my TEXT field in a database under the charset: utf8_general_ci.
I have the following tags under my <head> on my webpage:
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I pull the information from my database with the following set:
mysql_set_charset('UTF8',$connection);
(I know MYSQL is depreciated)
But when I get information from the database, I end up with this:
Ã¢Â€Â“ Europe
If I take this string and run it through utf8_decode, I get this:
â�?�? Europe
I even tried running it thorugh utf8_encode, and I got this:
ÃÂ¢Ãâ¬Ãâ Europe
Can someone explain to me why this is happening? I dont understand. I even ran the string through mb_detect_encoding and It said the string was utf8. So why is not printing correctly?
The solution (or not really a solution, because it ruins the rest of the website) is to remove the mysql_set_encoding line, and use utf8_decode. Then it prints out fine. BUT WHY!?

You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011, which for brevity I will express in hexadecimal notation as 0xe28093, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293 (which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset (in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1 unless you override their default configuration). In order for a result that MySQL transcodes to latin1 to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1 again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ... to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...—if an entire column is misencoded, you can instead use ALTER TABLE to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.

Character encoding MSSQL.. ISO -> Utf-8 -> Latin-1..need reversed

We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).

If latin1 is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1 is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.

MySQL outputs Western encoding in UTF-8 PHP file

I have the following problem: on a very simple php-mysqli query:
if ( $result = $mysqli->query( $sqlquery ) )
{
$res = $result->fetch_all();
$result->close();
}
I get strings wrongly encoded as Western encoded string, although the database, the table and the column is in utf8_general_ci collation. The php script itself is utf-8 encoded and the mysql-less parts of the script get the correct encodings. So say echo "ő" works perfectly, but echo $res[0] from the previous example outputs the EF BF BD character when the file viewed in the correct UTF-8 encoding. If I manually switch the browser's encoding to Western, the mysqli sourced strings get good decoding, except for the non-western characters being replaced with "?'.
What makes it even stranger is that on my development environment this isn't happening, while on my webserver it is. The developer environment is a LAMP stack (The Uniform Server), while the webserver uses nginx.
In this case, I entered the data in the database using phpMyAdmin, and inside phpmyadmin it displays perfectly. phpMyAdmin's collation is utf-8 too. I believe that the problem must be somewhere around here, as on the same webserver, for an other site where I enter data through php (using POST) the same problem doesn't happen. On that case, the data is visible correctly both while entering and while viewing it (I mean in the php generated webpages), but the special characters are not correct in phpMyAdmin.
Can you help me start where to debug? Is it connected to php or mysql or nginx or phpMyAdmin?

Use mysqli_set_charset to change the client encoding to UTF-8 just after you connect:
$mysqli->set_charset("utf8");
The client encoding is what MySql expects your input to be in (e.g. when you insert user-supplied text to a search query) and what it gives you the results in (so it has to match your output encoding in order for echo to display things correctly).
You need to have it match the encoding of your web page to account for the two scenarios above and the encoding of the PHP source file (so that the hardcoded parts of your queries are interpreted correctly).
Update: How to convert data inserted using latin-1 to utf-8
Regarding data that have already been inserted using the wrong connection encoding there is a convenient solution to fix the problem. For each column that contains this kind of data you need to do:
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET latin1;
ALTER TABLE table_name MODIFY column_name BLOB;
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET utf8;
The placeholders table_name, column_name and existing_column_type should be replaced with the correct values from your database each time.
What this does is
Tell MySql that it needs to store data in that column in latin1. This character set contains only a small subset of utf8 so in general this conversion involves data loss, but in this specific scenario the data was already interpreted as latin1 on input so there will be no side effects. However, MySql will internally convert the byte representation of your data to match what was originally sent from PHP.
Convert the column to a binary type (BLOB) that has no associated encoding information. At this point the column will contain raw bytes that are a proper utf8 character string.
Convert the column to its previous character type, telling MySql that the raw bytes should be considered to be in utf8 encoding.
WARNING: You can only use this indiscriminate approach if the column in question contains only incorrectly inserted data. Any data that has been correctly inserted will be truncated at the first occurrence of any non-ASCII character!
Therefore it's a good idea to do it right now, before the PHP side fix goes into effect.

Use mysqli::set_charset function.
$mysqli->set_charset('utf8'); //returns false if the encoding was not valid... won't happen
http://php.net/manual/en/mysqli.set-charset.php
I haven't used mysqli for some time, but if things are the same, connections by default use the latin swedish encoding (ISO 8859 1).
I will consider your page is already using utf8 encoding by having:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Inside the <head> tag.
If you have string already on latin swedish encoding, you can use mk_convert_encoding:
http://php.net/manual/en/function.mb-convert-encoding.php
$fixedStr = mb_convert_encoding($wrongStr, 'UTF-8', 'ISO-8859-1');
iconv does something very similar: Truth be told, I don't know the difference, but here's the link to the function reference:
http://php.net/manual/en/function.iconv.php
I just realized that you might have some strings in utf8 and others in latin swedish. You can use mb_detect_encoding for that: http://php.net/manual/en/function.mb-detect-encoding.php
You can also dump the database and use iconv (cmd line) if you have it installed:
iconv -f latain -t utf-8 < currentdb.sql > fixeddb.sql

PHP MySQL Chinese UTF-8 Issue

I have a MySQL table & fields that are all set to UTF-8. The thing is, a previous PHP script, which was in charge of the database writing, was using some other encoding, not sure whether it is in the script itself, the MySQL connection or somewhere else. The result is that although the table & fields are set to UTF-8, we see the wrong chars instead of Chinese.
It looks like that:
Now, the previous scripts (which were in charge of the writing and corrupted the data) can read it well for some reason, but my new script which all encoded in UTF-8, shows chars like ½©. How can that be fixed?

By the sound of it, you have a utf8 column but you are writing to it and reading from it using a latin1 connection, so what is actually being stored in the table is mis-encoded. Your problem is that when you read from the table using a utf8 connection, you see the data that's actually stored there, which is why it looks wrong. You can fix the mis-encoded data in the table by converting to latin1, then back to utf8 via the binary character set (three steps in total).

The original database was in a Chinese encoding – GB-18030 or similar, not Latin-1 – and the bytes that make up these characters, when displayed in UTF-8, show up as a bunch of Latin diacritics. Read each string as GB-18030, convert it to UTF-8, and save.

Don't fix whats not broken

I have made code that stores utf-8 in a database.
It shows it well in the browser but looks distorted in the database. Since the functionality seems to work and it doesn't look like I have had any problems with processing the string input, is it any point in 'fixing what is not broken' and make utf-8 characters like Japanese show in the database?
I don't search the database since the strings are serialized anyway.

You have to specify the text encoding of the queries, you are sending to MySQL with for instance
SET NAMES `utf8` COLLATE `utf8_unicode_ci`
If you don't, MySQL may interpret your query with the servers default text-encoding that can be different to UTF-8, e.g. iso-latin. So you will have strings in your tables, that are UTF-8 encoded, but MySQL marked them as iso-latin. That won't have much effect on your code, because MySQL just returns your UTF-8 strings back to you and you ignore the text-encoding. If you view the data in phpMyAdmin or any other application, that sets the connections character encoding, you will end up with distorted strings.
You could on the other hand utf8_decode your query strings and utf8_encode the result's provided by MySQL and don't change the connections text encoding from iso-latin. but if you query a different MySQL server that uses UTF-8 as default text encoding, you will end up with the same problem the other way around. so just set the connection's text encoding once after connecting.

What do you use to access the database. If you use a console just the the encoding in the console to utf-8. If you use GUI software just check the options the set the encoding to utf-8. You can try 'set names' to ser the client encoding.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.