Help with proper character encoding

Help with proper character encoding - php

I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.

Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.

Also, what document type have you set, is it?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Try using the htmlentities() function for any text that is not showing correctly.
You may also want to have a look PHP Normalizer.

Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.
(Also, your sql table should be using utf8, not latin1)

It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).
Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.
(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).

Related

Displaying some sort of apostrophe (Currently shows as diamond)

Some rows in my database contain an apostrophe of sorts, that, when displayed with PHP, are converted to diamonds with a question mark in the center. Example, if it copies correctly: Captain Jim O’Brien
These "apostrophes" were inserted most likely via TinyMCE, where the user was copying and pasting from Word, or something from a Mac computer perhaps.
How can I display these "apostrophes"? When I view the row in PHPMyAdmin, the apostrophes are displayed (no diamond), so there is obviously a way.
My character encoding is set to UTF-8, and I've tried htmlspecialchars($string) and htmlentities($string), with no luck.

Characters are encoded in different places.
MySQL has a particular character encoding. By default, it is not UTF-8 but rather latin1.
The HTML document you generate using PHP also has a particular character encoding specified. Finally, the actual bytes in the HTML document factually assume a particular character encoding, which if you're not careful can be different than the character encoding you specify for the document.
Verify that your MySQL encoding is set to UTF-8 as a first step. Note that MySQL can have the default character encoding for the database overridden on a per-table or even per-column basis.
You may be interested in this related post to get a deeper understanding of character encoding
Character Encoding and the â€™ Issue
Update
Something put the data into the MySQL database in the first place. Perhaps that "something" was not using UTF-8 encoding.

PHP charset accents issue

I have a form in my page for users to leave a comment.
I'm currently using this charset:
meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"
but retrieveving the comment from DB accents are not displaying correct ( Ex. è =>Ã¨ ).
Which parameters should i care about for a correct handling of accents?
SOLVED
changed meta tag to charset='utf-8'
changed character-set Mysql (ALTER TABLE comments CONVERT TO CHARACTER SET utf-8)
changed connection character-set both when inserting records and retrieving ($conn->query('SET NAMES utf8'))
Now accents are displaying correct
thanks
Luca

Character sets can be complicated and pain to debug when it comes to LAMP web applications. At each of the stages that one piece of software talks to another there's scope for incorrect charset translation or incorrect storage of data.
The places you need to look out for are:
- Between the browser and the web server (which you've listed already)
- Between PHP and the MySQL server
The character you've listed look like normal a European character that will be included in the ISO-8859-1 charset.
Things to check for:
even though you're specifying the character set in a meta header have a look in your browser to be sure which character set the browser is actually using. If you've specified it the browser should use that charset to render/view the page but in cases I've seen it attempting to auto-detect the correct charset and failing. Most browsers will have an "encoding" menu (perhaps under "view") that allows you to choose the charset. Ensure that it says ISO-8859-1 (Western European).
MySQL can happily support character set conversion if required to but in most cases you want to have your tables and client connection set to use the same encoding. When configured this way MySQL won't attempt to do any encoding conversion and will just write the data you input byte for byte into the table. When read it'll come out the same way byte for byte.
You've not said if you're reading data from the database back out with the same web-app or with some other client. I'd suggest you try to read it out with the same web application and using the same meta charset header (again, check the browser is really setting it) and see what is displayed in the browser.
To debug these issues requires you to be really sure about whether the client/console you're using is doing any conversion too, the safest way is sometimes to get the data into a hex editor where you can be sure that nothing else is messing around with any translation.
If it doesn't look like it's a browser-side problem please can you include the output of the following commands against your database:
Run from a connection that your web-app makes (not from some other MySQL client):
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
Run from any MySQL client:
SHOW CREATE TABLE myTable;
(where myTable is the table you're reading/writing data from/to)

The ISO-8859-1 character set is for Latin characters only. Try UTF-8, and make sure that the database these characters are coming from are also UTF-8 columns.

META value charset=UTF-8 prevents UTF-8 characters showing

I've made a test program that is basically just a textarea that I can enter characters into and when I click submit the characters are written to a MySQL test table (using PHP).
The test table is collation is UTF-8.
The script works fine if I want to write a é or ú to the database it writes fine. But then if I add the following meta statement to the <head> area of my page:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...the characters start becoming scrambled.
My theory is that the server is imposing some encoding that works well, but when I add the UTF-8 directive it overrides this server encoding and that this UTF-* encoding doesn't include the characters such as é and ú.
But I thought that UTF-8 encoded all (bar Klingon etc) characters.
Basically my program works but I want to know why when I add the directive it doesn't.
I think I'm missing something.
Any help/teaching most appreciated.
Thanks in advance.

Firstly, PHP generally doesn't handle the Unicode character set or UTF-8 character encoding. With the exception of (careful use of) mb_... functions, it just treats strings as binary data.
Secondly, you need to tell the MySQL client library what character set / encoding you're working with. The 'SET NAMES' SQL command does the job, and different MySQL clients (mysql, mysqli etc..) provide access to it in different ways, e.g. http://www.php.net/manual/en/mysqli.set-charset.php
Your browser, and MySQL client, are probably both defaulting to latin1, and coincidentally matching. MySQL then knows to convert the latin1 binary data into UTF-8. When you set the browser charset/encoding to UTF-8, the MySQL client is interpreting that UTF-8 data as latin1, and incorrectly transcoding it.
So the solution is to set the MySQL client to a charset matching the input to PHP from the browser.
Note also that table collation isn't the same as table character set - collation refers to how strings are compared and sorted. Confusing stuff, hope this helps!

How to make PHP use the right charset?

I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.

You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)

I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.

Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.

Changing character encoding in MySQL, PHP scripts, HTML

So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:
MySQL - all data is stored with the Latin1 character set
PHP - All PHP text files are stored on disk with Latin1 encoding
HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag
So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.
So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.
What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?
The META tag is luckily added dynamically, so I'll change that in one place only :)
Let me hear about your experiences with this.

It's tricky.
You have to:
change the DB and every table character set/encoding – I don't know much about MySQL, but see here
set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
change the meta tag and possible the Content-type header (note the Content-type header has precedence)
convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.

Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.
Functions
Certain function calls don't work anymore. For latin1 it's:
echo htmlentities($string);
For UTF8 it's:
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
strlen(), substr(), etc. Aren't aware of the multibyte characters.
MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.
So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)
Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the ë notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.