Preparing PHP application to use with UTF-8

Preparing PHP application to use with UTF-8 - php

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.
How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?
Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example?
(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

Some useful options to have in .htaccess:
########################################
# Locale settings
########################################
# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"
SetEnv LC_ALL nl_NL.UTF-8
########################################
# Set up UTF-8 encoding
########################################
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
php_value default_charset "UTF-8"
php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"
php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6
# See also php functions:
# mysql_set_charset
# mysql_client_encoding
# database settings
#CREATE DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#
#ALTER DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#ALTER TABLE tbl_name
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# ;

You're right UTF-8 is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Handling character encodings in HTML and CSS
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.
PHP Settings
As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
default_charset - PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well <meta http-equiv="Content-type" content="text/html;charset=UTF-8">.
output_handler - This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb, iconv) can have it's own encoding settings (see Strings).
Strings
StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b"binary string";.
mb_internal_encoding()Docs - Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
iconv_set_encoding()Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you're looking for UTF-8. Other functions like html_entity_decodeDocs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.
To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

All your files have to be saved in UTF-8 (without BOM) using your code editor.
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
header('Content-Type: text/html; charset=utf-8');
Add HTML meta content-type:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use htmlspecialchars() instead of htmlentities() because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.
Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte String or other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
For regular expressions use u modifier. For example:
preg_match('/ž{3,5}/u', $string, $matches);
Together this is the most reliable way to check if the given string is valid utf-8 string:
if (#preg_match('//u', $string) === false) {
// NOT valid!
} else {
// Valid!
}
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
mysql_set_charset('utf8', $link);
Also check if columns in the database are in utf-8. It's not always needed but recomended.

Basically I do three things to work correctly with czech language:
1) define locale in PHP:
setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");
so you would use something like:
setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");
based on language which is currently switched to.
2) define charset for the database:
mysql_query("set names latin2 collate latin2_czech_cs");
3) define the charset of PHP/HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8 (based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.

Try one of the following:
AddDefaultCharset UTF-8
AddCharset UTF-8 .php

Related

Migrating a php application to handle UTF-8

I am working on a multi-language app in php.
All was fine until recently I was asked to support Chinese characters. The actions I took to support UTF-8 characters are the following:
All DB tables are now UTF-8
HTML templates contain the tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The controllers send out a header specifying the encoding (utf-8) to use for the http response
All was good until I started making some string manipulations (substr and the likes)
With chinese it won't work because the chinese is represented as multibytes and hence if you do a normal substring (substr) it will prolly cut a "letter" in the middle of one of the bytes allocated and f*ck up the result on screen.
I fixed ALL my problems by adding this in the bootstrap
mb_internal_encoding("UTF-8");
and replacing all the strlen, substr, strstr with their mb_ counterparts.
What other things do I need to do to support UTF-8 fully in php?

There's a little more to it than just replacing those functions.
Regular expressions
You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.
$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?
\p{L} instead of \w for any 'letter' character.
\p{Z} instead of \s for any 'space' character.
\p{N} instead of \d for any 'digit' character e.g. Arabic numbers
There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.
Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.
Function replacements
Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.
$unsafeFunctions = array(
'mail' => 'mb_send_mail',
'split' => null, //'mb_split', deprecated function - just don't use it
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr',
'strlen' => 'mb_strlen',
'strpos' => 'mb_strpos',
'strrpos' => 'mb_strrpos',
'strrchr' => 'mb_strrchr',
'strripos' => 'mb_strripos',
'strstr' => 'mb_strstr',
'strtolower' => 'mb_strtolower',
'strtoupper' => 'mb_strtoupper',
'substr_count' => 'mb_substr_count',
'substr' => 'mb_substr',
'str_ireplace' => null,
'str_split' => 'mb_str_split', //TODO - check this works
'strcasecmp' => 'mb_strcasecmp', //TODO - check this works
'strcspn' => null, //TODO - implement alternative
'strrev' => 'mb_strrev', //TODO - check this works
'strspn' => null, //TODO - implement alternative
'substr_replace'=> 'mb_substr_replace',
'lcfirst' => null,
'ucfirst' => 'mb_ucfirst',
'ucwords' => 'mb_ucwords',
'wordwrap' => null,
);
MySQL
Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.
It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane
To support these you should in general use:
utf8mb4 - for your character encoding.
utf8mb4_unicode_ci - for your character collation.
For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.
The list of places where you should set the character set and collation in your MySQL config file are:
[mysql]
default-character-set=utf8mb4
[client]
default-character-set=utf8mb4
[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.
PHP INI File
Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:
mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header
Helping browser to choose UTF8 for forms
You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.
Misc
If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.
That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

How PHP knows the encoding of the .php files?

How does PHP know the encoding of the .php-files it interprets?
I mean the .php-files could be encoded in e.g. UTF-8 or CP 1252. And this would affect e.g. string literals.
Is there one setting in the php.ini? Or does PHP try to determine the encoding automatically (e.g. assume CP 1252 if no valid UTF-8 ...)?
Thanks for your explanation!

PHP source code makes no assumption about the source encoding. Everything is treated as binary. This means that if your editor saves a file as CP-1252 (I sure hope not), the strings you echo are also CP-1252.

The encoding of a file has very little to do with string literals in it. Strings are just a sequence of bytes as far as PHP is concerned, no further data is stored. If you include utf-8 strings in a iso-8859-15 file, it will still be the bytes of an utf-8 string. As these are just bytes, you are free to mix different encodings in strings in the same file (although they would look weird in any editor).
You are probably not looking to define an encoding of a file, but just how to handle & output strings. You can define what it outputs as a header (which is most likely what you want) with the default_charset ini-setting, and internal mb_ functions listen to mbstring.internal_encoding.
Note that zend.multibyte should be able to actually scan files in a different encoding which are not compatible with the normal scanner (for instance CP936, Big5, CP949, Shift_JIS), which you can configure in ini settings & help with a declare(encoding='name'), but I very much doubt this is what you are looking for. I have yet to test that functionality, and the documentation of it is next no non-existent.

What do these PHP mbstring settings do?

I'm trying to figure out exactly what these php.ini settings do. What happens when they're set to different values? When are they necessary? When are they harmful?
mbstring.language
mbstring.http_input
mbstring.http_output
mbstring.encoding_translation
As usual, the PHP manual is less than helpful.
EDIT: Just to clarify, I understand how character encodings work, and I understand how PHP's multi-byte functions differ from their single-byte counterparts. I'm looking for specifics on what the above settings do.
EDIT 2: OK, it looks like they actually do provide more documentation than just the page on runtime configuration, which just has one-line summaries. The first three of these have similarly-named functions, and there are more details on the pages that describe the function versions. I added links above.
EDIT 3: Adding a bounty. I'm looking for specific details on exactly what these settings do, especially the last three. What do they convert from and to, and when do they do it?

You can change mbstring.language to whatever language you are using with. (Source)
language
; language for internal character representation.
mbstring.language = Neutral ; Set default language to neutral(UTF-8) (default)
mbstring.language = English
mbstring.language = Japanese
mbstring.language = Korean ;For Korean market later
http_input
; http input encoding.
mbstring.http_input = pass
mbstring.http_input = auto
mbstring.http_input = UTF-8
mbstring.http_input = UTF-8, SJIS, EUC-JP
http_output
; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = pass
mbstring.http_output = UTF-8
encoding translation
; enable automatic encoding translation accoding to
; mbstring.internal_encoding setting. Input chars are
; converted to internal encoding by setting this to On.
; Note: Do _not_ use automatic encoding translation for
; portable libs/applications.
mbstring.encoding_translation = On

The point is to support different character set encodings. There are a wide variety of encodings (ASCII, ANSI, UTF-8, etc) and each one has different character sets and number of bytes per character. The settings your looking at specify default encodings for different PHP functions.
PHP supplies a number of functions that help you deal with these different encodings properly. For an illustration, check out mb_strlen() vs strlen().
Short answer is, unless you're localizing your application's text, or communicating with systems with different encodings (your database included!), you probably don't need to worry about it.

I think everything is explained by the demonstration in this example:
http://fr2.php.net/manual/en/function.mb-internal-encoding.php#53265
Though it's not used in it, you can deduce the use of mbstring.http_input.

How to make internal processing encoding change to UTF8 in PHP?

Currently in my application the utf8 encoded data is spoiled by internal coding of PHP.
How to make it consistent with utf8?
EDIT:To show examples,please tell me how to output the current internal encoding in PHP?
In php.ini I found the following:
default_charset = "iso-8859-1"
Which means Latin1.
How to change it to utf8,say,what's the iso version of utf8?

Change it to:
default_charset = "utf-8"
There is no ISO version of UTF-8.
You'll need to be specific with the details since encoding can be mangled at many different areas in your PHP application.
The common problem areas are:
Saving and retrieving from DB:
The database encoding must the same as the strings sent to it from PHP, or you must convert the strings to the DB encoding.
PHP4's single byte string functions:
PHP's functions such as strlen(), str_replace() do not produce the correct results on multibyte encodings such as UTF-8, since they operate on single bytes.
Page encoding:
Make sure the browser knows you are sending it UTF-8.

You can change the character encoding in php file. To change encoding in php page use the following function.
$new_value = htmlentities('$old_value',ENT_COMPAT, "UTF-8");
and also you can add the following in the html head section
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I hope this will help to solve your problem.

Switch website encoding from ISO-8859-1 to UTF-8

I am trying to convert my existing PHP webpage to use UTF-8 encoding.
To do so, I have done the following things:
specified UTF-8 as the charset in the meta content tag at the start of my webpage.
change the default_charset to UTF-8 in the php.ini.
specified UTF-8 as the iconv encoding in the php.ini file.
specified UTF-8 in my .htaccess file using: AddDefaultCharset UTF-8.
Yet after all that, when i echo mb_internal_encoding(), it shows as ISO-8859-1. What am I missing here? I know I could use auto_prepend to attach a script that changes the default encoding to UTF-8, but I'm just trying to understand what I'm missing.
Thanks

mb_internal_encoding() doesn't effect the output of your scripts per se, it effects the default encoding when using the multibyte string functions and the conversion of POST and GET inputs.
Simply set with
mbstring.internal_encoding='UTF-8'
in your php.ini file, or programmatically with:
mb_internal_encoding('UTF-8');
Speaking of the mb_ functions, you'll need to rewrite your scripts to use these, e.g. mb_strlen() instead of strlen.(), etc.
Also check what HTTP content-type headers are being outputted, though from what you've done it should be ok.
If you using a database, you'll also have to convert that too, and specify that you're using UTF-8 when connecting to it.

The documentation states that you can SET that variable using
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
which should get rid of your problem :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.