Dealing with strange encoding for PHP/MySQL import

Dealing with strange encoding for PHP/MySQL import - php

We get a daily upload of a CSV file from a client that they say is in UTF16-LE encoding. However, when I run iconv('UTF16-LE', 'UTF8') on each line of the CSV file, it looks like this when going into the database:
Z�A�A�0�7�3�7
IE, there's one of those [?] things in between every character.
I tried utf8_encode and various combinations of iconv and different encoding types in order to get this to go away. Has anyone had any experience with this and how to convert an unknown or unsupported encoding into UTF8, or at least something readable by PHP and MySQL?

Half of the characters in UTF16 cannot be converted into UTF8. UTF16 takes an addition 8 bits.
UTF16 has, encoded into each string, LE or BE. Just for fun, you could try converting from UTF16 to UTF8 (no '-LE'). This would tell you if your client lied to you about LE. But it's most likely the case that the data just doesn't fit.
One solution would be to store it as byte arrays (BINARY(x)) in the database, not as text.

Related

Form saves special latin characters as symbols

My PHP form is submitting special latin characters as symbols.
So, Québec turns into QuÃ©bec
My form is set to UTF-8 and my database table has latin1_swedish_ci collation.
PHP: $db = new PDO('mysql:host=localhost;dbname=x;charset=utf8', 'x', 'x');
A bindParam: $sql->bindParam(":x", $_POST['x'],PDO::PARAM_STR);
I am new to PDO so I am not sure what the problem is. Thank you
*I am using phpMyAdmin

To expand a little bit more on the encoding problem...
Any time you see one character in a source turn into two (or more characters), you should immediately suspect an encoding issue, especially if UTF-8 is involved. Here's why. (I apologize if you already know some of this, but I hope to help some future SO'ers as well.)
All characters are stored in your computer not as characters, but as bytes. Back in the olden days, space and transmission time were much more limited than now, so people tried to save every byte possible, even down to not using a full byte to store a character. Now, because we realize that we need to communicate with the whole world, we've decided it's more important to be able to represent every character in every language. That transition hasn't always been smooth, and that's what you're running up against.
Latin-1 (in various flavors) is an encoding that always uses a single 8-bit byte for a character. Which means it can only have 256 possible characters. Plenty if you only want to write English or Swedish, but not enough to add Russian and Chinese. (background on Latin-1)
UTF-8 encodes the first half of Latin-1 in exactly the same way, which is why you see most of the characters looking the same. But it doesn't always use a single byte for a character -- it can use up to four bytes on one character. (utf-8) As you discovered, it uses 2 bytes for é. But Latin-1 doesn't know that, and is doing its best to display those two bytes.
The trick is to always specify your encoding for byte streams (like info from a file, a URL, or a database), and to make sure that encoding is correct. (Sometimes that's a pain to find out, for sure.) Most modern languages, like Java and PHP do a good job of handling all the translation issues between different encodings, as long as you've correctly specified what you're dealing with.

You've pretty much answered your own question: you're receiving UTF-8 from the form but trying to store it in a Latin-1 column. You can either change the encoding on the column in MySQL or use the iconv function to translate between the two encodings.

Change your database table and column to utf8_unicode_ci.

Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
Set headers:
<?php header("Content-type: text/html; charset=utf-8"); ?>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Character encoding MSSQL.. ISO -> Utf-8 -> Latin-1..need reversed

We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).

If latin1 is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1 is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.

PHP MySQL Chinese UTF-8 Issue

I have a MySQL table & fields that are all set to UTF-8. The thing is, a previous PHP script, which was in charge of the database writing, was using some other encoding, not sure whether it is in the script itself, the MySQL connection or somewhere else. The result is that although the table & fields are set to UTF-8, we see the wrong chars instead of Chinese.
It looks like that:
Now, the previous scripts (which were in charge of the writing and corrupted the data) can read it well for some reason, but my new script which all encoded in UTF-8, shows chars like ½©. How can that be fixed?

By the sound of it, you have a utf8 column but you are writing to it and reading from it using a latin1 connection, so what is actually being stored in the table is mis-encoded. Your problem is that when you read from the table using a utf8 connection, you see the data that's actually stored there, which is why it looks wrong. You can fix the mis-encoded data in the table by converting to latin1, then back to utf8 via the binary character set (three steps in total).

The original database was in a Chinese encoding – GB-18030 or similar, not Latin-1 – and the bytes that make up these characters, when displayed in UTF-8, show up as a bunch of Latin diacritics. Read each string as GB-18030, convert it to UTF-8, and save.

How do you convert latin1 to utf8 character encoding?

So, I currently have this problem - I have a sql db dump and the character encoding in it is latin1, but there are some utf8 chars in the file that look like Ä (should be ā) Ä« (should be ī) Å¡ (should be š) Ä“ (should be ē) etc. How do I convert these leters back to the original utf8.?
Character in the file <-> what it should have been <-> bytes
Ä“ <-> ē <-> 5
Ä <-> ā <-> 2
Å¡ <-> š <-> 4
Ä« <-> ī <-> 4

If you're seeing multiple bytes for what should be single characters, chances are it's already in UTF-8. Bear in mind that ISO-8859-1 is a single-byte-per-character encoding, whereas UTF-8 can take multiple bytes - and any non-ASCII character does take multiple bytes.
I suggest you open the file in a UTF-8-aware text editor, and check it there.

Encoding should be set on the connection on which you import data and read out data. If both of them are set to UTF-8, you will face no problems.
If you however import them with a latin1 connection, and later on reading it out with a UTF-8, you're in a world of trouble.
PHP internally only handles latin1, however that isn't nessecarily a problem for you.
If you have already wrongly imported the data, you would see a lot of ? or (diamond + ?) on your output I think.
But basically, when connecting frmo PHP, make sure to invoke SET NAMES 'utf8' first thing you do and see if that works.
If data still is wrong, you could use PHPs functions utf8_encode / utf8_decode to convert the data that is problematic.
In a working scenario they should never be used though.

PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:
pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"
I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").
When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?
(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)

Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.
You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".

BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.
The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.