Migrating a php application to handle UTF-8

Migrating a php application to handle UTF-8 - php

I am working on a multi-language app in php.
All was fine until recently I was asked to support Chinese characters. The actions I took to support UTF-8 characters are the following:
All DB tables are now UTF-8
HTML templates contain the tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The controllers send out a header specifying the encoding (utf-8) to use for the http response
All was good until I started making some string manipulations (substr and the likes)
With chinese it won't work because the chinese is represented as multibytes and hence if you do a normal substring (substr) it will prolly cut a "letter" in the middle of one of the bytes allocated and f*ck up the result on screen.
I fixed ALL my problems by adding this in the bootstrap
mb_internal_encoding("UTF-8");
and replacing all the strlen, substr, strstr with their mb_ counterparts.
What other things do I need to do to support UTF-8 fully in php?

There's a little more to it than just replacing those functions.
Regular expressions
You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.
$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?
\p{L} instead of \w for any 'letter' character.
\p{Z} instead of \s for any 'space' character.
\p{N} instead of \d for any 'digit' character e.g. Arabic numbers
There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.
Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.
Function replacements
Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.
$unsafeFunctions = array(
'mail' => 'mb_send_mail',
'split' => null, //'mb_split', deprecated function - just don't use it
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr',
'strlen' => 'mb_strlen',
'strpos' => 'mb_strpos',
'strrpos' => 'mb_strrpos',
'strrchr' => 'mb_strrchr',
'strripos' => 'mb_strripos',
'strstr' => 'mb_strstr',
'strtolower' => 'mb_strtolower',
'strtoupper' => 'mb_strtoupper',
'substr_count' => 'mb_substr_count',
'substr' => 'mb_substr',
'str_ireplace' => null,
'str_split' => 'mb_str_split', //TODO - check this works
'strcasecmp' => 'mb_strcasecmp', //TODO - check this works
'strcspn' => null, //TODO - implement alternative
'strrev' => 'mb_strrev', //TODO - check this works
'strspn' => null, //TODO - implement alternative
'substr_replace'=> 'mb_substr_replace',
'lcfirst' => null,
'ucfirst' => 'mb_ucfirst',
'ucwords' => 'mb_ucwords',
'wordwrap' => null,
);
MySQL
Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.
It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane
To support these you should in general use:
utf8mb4 - for your character encoding.
utf8mb4_unicode_ci - for your character collation.
For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.
The list of places where you should set the character set and collation in your MySQL config file are:
[mysql]
default-character-set=utf8mb4
[client]
default-character-set=utf8mb4
[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.
PHP INI File
Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:
mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header
Helping browser to choose UTF8 for forms
You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.
Misc
If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.
That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

Related

Php qr code generator works strange with utf-8 phrase

I downloaded the library http://phpqrcode.sourceforge.net/ and wrote simplest code for it
include('./phpqrcode/qrlib.php');
QRcode::png('иванов иван иванович 11111');
But resulted qr code contains only half of string
Resulted qr code - 'иванов иван ив';
url - vologda-oblast.ru/coronavirus/qr/parampng.php
What can be wrong?

The "phpqrcode" library in your case encodes a number of characters instead of the number of bytes of a UTF-8 string. That’s why the string is truncated. If you QR-encode English-only text, the string will not be truncated. The truncation occurs only with Cyrillic characters since it takes 2 bytes to encode each Cyrillic character in UTF-8 rather than just a single byte for a Latin one.
Interestingly, the demo example of the library on the author’s page do encode Cyrillic characters correctly.
The truncation happens in your case because you are using the following options in your php.ini file:
mbstring.func_overload = 2
mbstring.internal_encoding = "UTF-8"
If you remove the mbstring.func_overload (deprecated since PHP 7.2.0) from php.ini or set it 0, the "phpqrcode" library will start working properly. Otherwise, the strlen() function used by the library will return number of characters rather than the number of bytes in a UTF8-ecoded octet string, while str_split(), another function used by the library, will always return the number of bytes since it is not affected by mbstring.func_overload. As a result, your QR-codes will contain truncated strings.
Since you are using the Bitrix Site Manager CMS, removing the mbstring.func_overload from php.ini may be problematic until you fully update Bitrix to 20.5.393 (released on September 2020) or later version. Earlier version did rely on this deprecated feature. You can find find more information about Bitrix reliance on this deprecated feature at https://idea.1c-bitrix.ru/remove-dependency-on-mbstring-settingsfuncoverload/ or https://idea.1c-bitrix.ru/?tag=4799
Since you cannot change php.ini configuration on run-time, you can try to configure your web server to have php options configure on a per-directory level. Failing that, you can fix the code of the "phpqrcode" library to work correctly, at least partially, in your case, to not rely on the strlen() function. To to that, edit the qrencode.php file the following way. First, replace the $eightbit constant of the QREncode class from false to true. Second, in the function encodeString8bit, replace
$ret = $input->append(QR_MODE_8, strlen($string), str_split($string));
to
$arr = str_split($string);
$len = count($arr);
$ret = $input->append(QR_MODE_8, $len, $arr);
Anyway, since the "phpqrcode" library does not currently support Extended Channel Interpretations (ECI) mode, you cannot reliably encode Cyrillic characters with the library. It uses the 8-bit string mode of storing text in a QR code, which by default may only contain ISO-8859-1 (Latin-1) characters unless the default character set is modified by a ECI entry. But the library cannot insert the ECI entry into a QR code to show that the text has UTF-8 encoding rather than ISO-8859-1. Some decoding applications will auto-detect the wrong charset and show the string correctly, while some (compliant) may not.
As a conclusion, since the "phpqrcode" does not currently support ECI, you cannot reliably encode Cyrillic characters with it, but you can at least make it not truncate the string as I have shown above.

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!

I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.

You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.

Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

PHP: html_entity_decode removing/not showing character

I am having a problem with Â character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the Â character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.

Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

Why does PHP's iconv need setlocale?

I'm currently trying to remove all special characters and accents from an UTF-8 string by turning them into their equivalent ASCII character if possible.
So I'm simply using this code:
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
The problem is that for example the word "début" turns into "dbut" instead of "debut".
To make it work, I need to add a call to setlocale, like this:
setlocale(LC_ALL, 'en_US.UTF8');
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
And I don't understand why. I thought UTF-8 and ASCII were always the same, whatever locale you use.
EDIT: I didn't mean UTF-8 equals ASCII, I meant UTF-8 always equals UTF-8 and ASCII always equals ASCII

The subset of UTF-8 that overlaps with ASCII (which is code points 0-127) is indeed identical with ASCII. However, accented latin characters are not part of the ASCII character set and if you don't setlocale yourself, the system's default locale (which evidently does not contain these accented characters) is used to get a character set to work with.
In general, iconv can be a little iffy; this is mentioned in the introduction of the extension:
This module contains an interface to iconv character set conversion
facility. With this module, you can turn a string represented by a
local character set into the one represented by another character set,
which may be the Unicode character set. Supported character sets
depend on the iconv implementation of your system. Note that the iconv
function on some systems may not work as you expect. In such case,
it'd be a good idea to install the GNU libiconv library. It will
most likely end up with more consistent results.

Preparing PHP application to use with UTF-8

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.
How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?
Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example?
(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

Some useful options to have in .htaccess:
########################################
# Locale settings
########################################
# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"
SetEnv LC_ALL nl_NL.UTF-8
########################################
# Set up UTF-8 encoding
########################################
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
php_value default_charset "UTF-8"
php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"
php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6
# See also php functions:
# mysql_set_charset
# mysql_client_encoding
# database settings
#CREATE DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#
#ALTER DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#ALTER TABLE tbl_name
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# ;

You're right UTF-8 is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Handling character encodings in HTML and CSS
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.
PHP Settings
As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
default_charset - PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well <meta http-equiv="Content-type" content="text/html;charset=UTF-8">.
output_handler - This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb, iconv) can have it's own encoding settings (see Strings).
Strings
StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b"binary string";.
mb_internal_encoding()Docs - Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
iconv_set_encoding()Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you're looking for UTF-8. Other functions like html_entity_decodeDocs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.
To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

All your files have to be saved in UTF-8 (without BOM) using your code editor.
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
header('Content-Type: text/html; charset=utf-8');
Add HTML meta content-type:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use htmlspecialchars() instead of htmlentities() because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.
Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte String or other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
For regular expressions use u modifier. For example:
preg_match('/ž{3,5}/u', $string, $matches);
Together this is the most reliable way to check if the given string is valid utf-8 string:
if (#preg_match('//u', $string) === false) {
// NOT valid!
} else {
// Valid!
}
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
mysql_set_charset('utf8', $link);
Also check if columns in the database are in utf-8. It's not always needed but recomended.

Basically I do three things to work correctly with czech language:
1) define locale in PHP:
setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");
so you would use something like:
setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");
based on language which is currently switched to.
2) define charset for the database:
mysql_query("set names latin2 collate latin2_czech_cs");
3) define the charset of PHP/HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8 (based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.

Try one of the following:
AddDefaultCharset UTF-8
AddCharset UTF-8 .php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.