Proper handling of foreign characters / emoji

Proper handling of foreign characters / emoji - php

Having trouble getting foreign characters and Emoji to display.
Edited to clarify
A user types an Emoji character into a text field which is then sent to the server (php) and saved into the database (mysql). When displaying the text we grab a JSON encoded string from the server, which is parsed and displayed on the client side.
QUESTION: the character for a "trophy" emoji saved in the DB reads as
%uD83C%uDFC6
When that is sent back to the client we don't see the emoji picture, we actually see the raw encoded text.
How would we get the client side to read that text as an emoji character and display the image?
(all on an iphone / mobile safari)
Thanks!

Check the encodings used by your client, your web server, and your database table. Make sure they are all using encodings that can handle the characters you are concerned about.

Looks like the problem is my MySql encoding... utf8mb4 would allow it - unfortunately it's unavailable before MySQL v5.5

the character for a "trophy" emoji saved in the DB reads as %uD83C%uDFC6
Then your data are already mangled. %u escapes are specific to the JavaScript escape() function, which should generally never be used. Make sure your textarea->PHP handling uses standards-compliant encoding, eg encodeURIComponent if you need to get a JS variable into a URL query.
Then, having proper raw UTF-8 strings in your PHP layer, you can worry about getting MySQL to store characters like the emoji that are outside of the Basic Multilingual Plane. Best way is columns with a utf8mb4 collation; if that is not available try binary columns which will allow you to store any byte sequence (treating it as UTF-8 when it comes back out). That way, however, you won't get case-insensitive comparisons.

Related

PHP inserting Japanese string to utf8 table as something else, but still reads it successfully

Why does PHP store characters such as Japanese in MySQL table that supports utf8 as something else but successfully reads the value back out from MySQL as the original string?
E.g.
$db = new mysqli("localhost", "user", "pwd", "test");
$sql = "INSERT INTO testtable(name) VALUES ('ボーナスエリア');
From workbench this has been inserted into the table as ãƒ‡ã‚£ã‚·ãƒ§ãƒ³
I have no idea how or at what level that encoding/mapping happens.
Reading it back out in PHP results in the correct string ボーナスエリア being displayed on the webpage.
Why and how does that work?
UPDATE
Thanks for all the comments so far.
More than just being curious it actually causes me a problem wanting to insert chars from another source i.e. Java which through jdbc inserts CJK chars correctly. This causes a problem in PHP reading them back out and displaying as ??????
Can anybody prove what encoding translates the characters given into what appears in db viewer?
UPDATE 2
My browser (which has nothing to do with this problem as value is ???? before it displays) is firefox with encoding set to Western ISO-8859-1. I can see Japanese characters display correctly next to ????? characters. Paradoxically, the characters that appear as ???? appear correctly in the db viewer.
Browser Settings
Web page snippet

PHP treats text mostly as arbitrary binary data. This means that in these cases it's quite common for two errors to cancel each other out.
For example, if you write ボーナスエリア in a source file and save it in UTF-8, what PHP sees are the bytes \xe3\x83\x9c\xe3\x83\xbc... and that's what it will work with. You can pass that string to a database client library, like here to mysqli, and, if you are lucky, when you later get the text back from the database the client library will return the exact same bytes to PHP. Independently of how the database actually stored the data.
What seems to be happening here is that the database client library is configured to interpret the data PHP hands to it according to latin1, which means that it interprets the bytes \xe3\x83\x9c... as the characters ãƒ‡..., and that's what the database will store. When you read the data the same thing happens: the client obtains the characters ãƒ‡... from the database, and since it's set to encode them in latin1, it will returns \xe3\x83\x9c... to PHP. This explains how you can have mojibake in the database, but the PHP application still seems to work fine.
Of course, it would be better to have the database store the text in a readable format. For that you have to set the client encoding (see mysqli_set_charset) and the database column encoding (see MySQL documentation) to to utf8.

Don't fix whats not broken

I have made code that stores utf-8 in a database.
It shows it well in the browser but looks distorted in the database. Since the functionality seems to work and it doesn't look like I have had any problems with processing the string input, is it any point in 'fixing what is not broken' and make utf-8 characters like Japanese show in the database?
I don't search the database since the strings are serialized anyway.

You have to specify the text encoding of the queries, you are sending to MySQL with for instance
SET NAMES `utf8` COLLATE `utf8_unicode_ci`
If you don't, MySQL may interpret your query with the servers default text-encoding that can be different to UTF-8, e.g. iso-latin. So you will have strings in your tables, that are UTF-8 encoded, but MySQL marked them as iso-latin. That won't have much effect on your code, because MySQL just returns your UTF-8 strings back to you and you ignore the text-encoding. If you view the data in phpMyAdmin or any other application, that sets the connections character encoding, you will end up with distorted strings.
You could on the other hand utf8_decode your query strings and utf8_encode the result's provided by MySQL and don't change the connections text encoding from iso-latin. but if you query a different MySQL server that uses UTF-8 as default text encoding, you will end up with the same problem the other way around. so just set the connection's text encoding once after connecting.

What do you use to access the database. If you use a console just the the encoding in the console to utf-8. If you use GUI software just check the options the set the encoding to utf-8. You can try 'set names' to ser the client encoding.

PHP char encoding to UTF-8 from various sources

Hey, guys. I work for http://pastebin.com and we have a little issue with the new API and char encoding.
On the site itself we run a meta tag which specifies that everything on the site, including the forms, are utf-8. Because of this all chars get stored in the right way, without having to modify any char types.
With the API however, people can send data from all kinds of different sources & forms, and therefor has to get checked and possibly changed, before storing it.
Chars that are giving a problem are for example:
고객님이 티빙
Iñtërnâtiônàlizætiøn
♥♥♥♥♥
идите в *оопу, он лучший)
What would be a good way to approach this data input to the API to make sure all chars get stored in a valid UTF-8 format, which will work on our site.

Assuming your client is sending utf8 data and headers correctly: Sounds like you're doing a utf8_encode() on already-encoded utf8 data.

Duplicate: What is the best way to handle uploaded text files of different encodings?
In a nutshell, the only reliable way is having the client specify what encoding they are using. Automatic encoding detection is imperfect and tends to be unreliable.
You could for example specify that incoming data needs an encoding specified if it's not UTF-8.

How to make PHP use the right charset?

I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.

You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)

I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.

Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.

Help with proper character encoding

I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.

Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.

Also, what document type have you set, is it?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Try using the htmlentities() function for any text that is not showing correctly.
You may also want to have a look PHP Normalizer.

Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.
(Also, your sql table should be using utf8, not latin1)

It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).
Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.
(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.