What do these PHP mbstring settings do?

What do these PHP mbstring settings do? - php

I'm trying to figure out exactly what these php.ini settings do. What happens when they're set to different values? When are they necessary? When are they harmful?
mbstring.language
mbstring.http_input
mbstring.http_output
mbstring.encoding_translation
As usual, the PHP manual is less than helpful.
EDIT: Just to clarify, I understand how character encodings work, and I understand how PHP's multi-byte functions differ from their single-byte counterparts. I'm looking for specifics on what the above settings do.
EDIT 2: OK, it looks like they actually do provide more documentation than just the page on runtime configuration, which just has one-line summaries. The first three of these have similarly-named functions, and there are more details on the pages that describe the function versions. I added links above.
EDIT 3: Adding a bounty. I'm looking for specific details on exactly what these settings do, especially the last three. What do they convert from and to, and when do they do it?

You can change mbstring.language to whatever language you are using with. (Source)
language
; language for internal character representation.
mbstring.language = Neutral ; Set default language to neutral(UTF-8) (default)
mbstring.language = English
mbstring.language = Japanese
mbstring.language = Korean ;For Korean market later
http_input
; http input encoding.
mbstring.http_input = pass
mbstring.http_input = auto
mbstring.http_input = UTF-8
mbstring.http_input = UTF-8, SJIS, EUC-JP
http_output
; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = pass
mbstring.http_output = UTF-8
encoding translation
; enable automatic encoding translation accoding to
; mbstring.internal_encoding setting. Input chars are
; converted to internal encoding by setting this to On.
; Note: Do _not_ use automatic encoding translation for
; portable libs/applications.
mbstring.encoding_translation = On

The point is to support different character set encodings. There are a wide variety of encodings (ASCII, ANSI, UTF-8, etc) and each one has different character sets and number of bytes per character. The settings your looking at specify default encodings for different PHP functions.
PHP supplies a number of functions that help you deal with these different encodings properly. For an illustration, check out mb_strlen() vs strlen().
Short answer is, unless you're localizing your application's text, or communicating with systems with different encodings (your database included!), you probably don't need to worry about it.

I think everything is explained by the demonstration in this example:
http://fr2.php.net/manual/en/function.mb-internal-encoding.php#53265
Though it's not used in it, you can deduce the use of mbstring.http_input.

Related

Php qr code generator works strange with utf-8 phrase

I downloaded the library http://phpqrcode.sourceforge.net/ and wrote simplest code for it
include('./phpqrcode/qrlib.php');
QRcode::png('иванов иван иванович 11111');
But resulted qr code contains only half of string
Resulted qr code - 'иванов иван ив';
url - vologda-oblast.ru/coronavirus/qr/parampng.php
What can be wrong?

The "phpqrcode" library in your case encodes a number of characters instead of the number of bytes of a UTF-8 string. That’s why the string is truncated. If you QR-encode English-only text, the string will not be truncated. The truncation occurs only with Cyrillic characters since it takes 2 bytes to encode each Cyrillic character in UTF-8 rather than just a single byte for a Latin one.
Interestingly, the demo example of the library on the author’s page do encode Cyrillic characters correctly.
The truncation happens in your case because you are using the following options in your php.ini file:
mbstring.func_overload = 2
mbstring.internal_encoding = "UTF-8"
If you remove the mbstring.func_overload (deprecated since PHP 7.2.0) from php.ini or set it 0, the "phpqrcode" library will start working properly. Otherwise, the strlen() function used by the library will return number of characters rather than the number of bytes in a UTF8-ecoded octet string, while str_split(), another function used by the library, will always return the number of bytes since it is not affected by mbstring.func_overload. As a result, your QR-codes will contain truncated strings.
Since you are using the Bitrix Site Manager CMS, removing the mbstring.func_overload from php.ini may be problematic until you fully update Bitrix to 20.5.393 (released on September 2020) or later version. Earlier version did rely on this deprecated feature. You can find find more information about Bitrix reliance on this deprecated feature at https://idea.1c-bitrix.ru/remove-dependency-on-mbstring-settingsfuncoverload/ or https://idea.1c-bitrix.ru/?tag=4799
Since you cannot change php.ini configuration on run-time, you can try to configure your web server to have php options configure on a per-directory level. Failing that, you can fix the code of the "phpqrcode" library to work correctly, at least partially, in your case, to not rely on the strlen() function. To to that, edit the qrencode.php file the following way. First, replace the $eightbit constant of the QREncode class from false to true. Second, in the function encodeString8bit, replace
$ret = $input->append(QR_MODE_8, strlen($string), str_split($string));
to
$arr = str_split($string);
$len = count($arr);
$ret = $input->append(QR_MODE_8, $len, $arr);
Anyway, since the "phpqrcode" library does not currently support Extended Channel Interpretations (ECI) mode, you cannot reliably encode Cyrillic characters with the library. It uses the 8-bit string mode of storing text in a QR code, which by default may only contain ISO-8859-1 (Latin-1) characters unless the default character set is modified by a ECI entry. But the library cannot insert the ECI entry into a QR code to show that the text has UTF-8 encoding rather than ISO-8859-1. Some decoding applications will auto-detect the wrong charset and show the string correctly, while some (compliant) may not.
As a conclusion, since the "phpqrcode" does not currently support ECI, you cannot reliably encode Cyrillic characters with it, but you can at least make it not truncate the string as I have shown above.

What is the character set if default_charset is empty

In PHP 5.6 onwards the default_charset string is set to "UTF-8" as explained e.g. in the php.ini documentation. It says that the string is empty for earlier versions.
As I am creating a Java library to communicate with PHP, I need to know which values I should expect when a string is handled as bytes internally. What happens if the default_charset string is empty and a (literal) string contains characters outside the range of ASCII? Should I expect the default character encoding of the platform, or the character encoding used for the source file?

Short answer
For literal strings -- always source file encoding. default_charset value does nothing here.
Longer answer
PHP strings are "binary safe" meaning they do not have any internal string encoding. Basically string in PHP are just buffers of bytes.
For literal strings e.g. $s = "Ä" this means that string will contain whatever bytes were saved in file between quotes. If file was saved in UTF-8 this will be equivalent to $s = "\xc3\x84", if file was saved in ISO-8859-1 (latin1) this will be equivalent to $s = "\xc4".
Setting default_charset value does not affect bytes stored in strings in any way.
What does default_charset do then?
Some functions, that have to deal with strings as text and are encoding aware, accept $encoding as argument (usually optional). This tells the function what encoding the text is encoded in a string.
Before PHP 5.6 default value of these optional $encoding arguments were either in function definition (e.g. htmlspecialchars()) or configurable in various php.ini settings for each extension separately (e.g. mbstring.internal_encoding, iconv.input_encoding).
In PHP 5.6 new php.ini setting default_charset was introduced. Old settings were deprecated and all functions that accept optional $encoding argument should now default to default_charset value when encoding is not specified explicitly.
However, developer is left responsible to make sure that text in string is actually encoded in encoding that was specified.
Links:
Details of the String Type
More details on nature of PHP strings (does not mention default_charset at the time of writing).
New features in PHP 5.6: Default character encoding
Short introduction of new default_charset option in PHP 5.6 release notes.
Deprecated features in PHP 5.6: iconv and mbstring encoding settings
List of deprecated php.ini options in favour of default_chaset option.

It seems you should not rely on the internal encoding. The internal character encoding can be seen/set with mb_internal_encoding.
example phpinfo()
PHP Version 5.5.9-1ubuntu4.5
default_charset no value
file1.php
<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1
file2.php
<?php
$string = "É";
echo mb_internal_encoding(); //ISO-8859-1
both files will output ISO-8859-1 if you do not change the internal encoding manually.
<?php
echo bin2hex("ö"); //c3b6 (utf-8)
Getting the hex of this character returns UTF-8 encoding. If you save the file using UTF-8 the string in this example will have 2 bytes, even if the internal encoding is not set to UTF-8. Therefore you should rely on the character encoding used for the source file.

Migrating a php application to handle UTF-8

I am working on a multi-language app in php.
All was fine until recently I was asked to support Chinese characters. The actions I took to support UTF-8 characters are the following:
All DB tables are now UTF-8
HTML templates contain the tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The controllers send out a header specifying the encoding (utf-8) to use for the http response
All was good until I started making some string manipulations (substr and the likes)
With chinese it won't work because the chinese is represented as multibytes and hence if you do a normal substring (substr) it will prolly cut a "letter" in the middle of one of the bytes allocated and f*ck up the result on screen.
I fixed ALL my problems by adding this in the bootstrap
mb_internal_encoding("UTF-8");
and replacing all the strlen, substr, strstr with their mb_ counterparts.
What other things do I need to do to support UTF-8 fully in php?

There's a little more to it than just replacing those functions.
Regular expressions
You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.
$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?
\p{L} instead of \w for any 'letter' character.
\p{Z} instead of \s for any 'space' character.
\p{N} instead of \d for any 'digit' character e.g. Arabic numbers
There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.
Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.
Function replacements
Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.
$unsafeFunctions = array(
'mail' => 'mb_send_mail',
'split' => null, //'mb_split', deprecated function - just don't use it
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr',
'strlen' => 'mb_strlen',
'strpos' => 'mb_strpos',
'strrpos' => 'mb_strrpos',
'strrchr' => 'mb_strrchr',
'strripos' => 'mb_strripos',
'strstr' => 'mb_strstr',
'strtolower' => 'mb_strtolower',
'strtoupper' => 'mb_strtoupper',
'substr_count' => 'mb_substr_count',
'substr' => 'mb_substr',
'str_ireplace' => null,
'str_split' => 'mb_str_split', //TODO - check this works
'strcasecmp' => 'mb_strcasecmp', //TODO - check this works
'strcspn' => null, //TODO - implement alternative
'strrev' => 'mb_strrev', //TODO - check this works
'strspn' => null, //TODO - implement alternative
'substr_replace'=> 'mb_substr_replace',
'lcfirst' => null,
'ucfirst' => 'mb_ucfirst',
'ucwords' => 'mb_ucwords',
'wordwrap' => null,
);
MySQL
Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.
It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane
To support these you should in general use:
utf8mb4 - for your character encoding.
utf8mb4_unicode_ci - for your character collation.
For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.
The list of places where you should set the character set and collation in your MySQL config file are:
[mysql]
default-character-set=utf8mb4
[client]
default-character-set=utf8mb4
[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.
PHP INI File
Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:
mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header
Helping browser to choose UTF8 for forms
You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.
Misc
If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.
That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

How PHP knows the encoding of the .php files?

How does PHP know the encoding of the .php-files it interprets?
I mean the .php-files could be encoded in e.g. UTF-8 or CP 1252. And this would affect e.g. string literals.
Is there one setting in the php.ini? Or does PHP try to determine the encoding automatically (e.g. assume CP 1252 if no valid UTF-8 ...)?
Thanks for your explanation!

PHP source code makes no assumption about the source encoding. Everything is treated as binary. This means that if your editor saves a file as CP-1252 (I sure hope not), the strings you echo are also CP-1252.

The encoding of a file has very little to do with string literals in it. Strings are just a sequence of bytes as far as PHP is concerned, no further data is stored. If you include utf-8 strings in a iso-8859-15 file, it will still be the bytes of an utf-8 string. As these are just bytes, you are free to mix different encodings in strings in the same file (although they would look weird in any editor).
You are probably not looking to define an encoding of a file, but just how to handle & output strings. You can define what it outputs as a header (which is most likely what you want) with the default_charset ini-setting, and internal mb_ functions listen to mbstring.internal_encoding.
Note that zend.multibyte should be able to actually scan files in a different encoding which are not compatible with the normal scanner (for instance CP936, Big5, CP949, Shift_JIS), which you can configure in ini settings & help with a declare(encoding='name'), but I very much doubt this is what you are looking for. I have yet to test that functionality, and the documentation of it is next no non-existent.

Preparing PHP application to use with UTF-8

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.
How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?
Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example?
(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

Some useful options to have in .htaccess:
########################################
# Locale settings
########################################
# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"
SetEnv LC_ALL nl_NL.UTF-8
########################################
# Set up UTF-8 encoding
########################################
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
php_value default_charset "UTF-8"
php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"
php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6
# See also php functions:
# mysql_set_charset
# mysql_client_encoding
# database settings
#CREATE DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#
#ALTER DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#ALTER TABLE tbl_name
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# ;

You're right UTF-8 is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Handling character encodings in HTML and CSS
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.
PHP Settings
As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
default_charset - PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well <meta http-equiv="Content-type" content="text/html;charset=UTF-8">.
output_handler - This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb, iconv) can have it's own encoding settings (see Strings).
Strings
StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b"binary string";.
mb_internal_encoding()Docs - Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
iconv_set_encoding()Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you're looking for UTF-8. Other functions like html_entity_decodeDocs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.
To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

All your files have to be saved in UTF-8 (without BOM) using your code editor.
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
header('Content-Type: text/html; charset=utf-8');
Add HTML meta content-type:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use htmlspecialchars() instead of htmlentities() because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.
Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte String or other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
For regular expressions use u modifier. For example:
preg_match('/ž{3,5}/u', $string, $matches);
Together this is the most reliable way to check if the given string is valid utf-8 string:
if (#preg_match('//u', $string) === false) {
// NOT valid!
} else {
// Valid!
}
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
mysql_set_charset('utf8', $link);
Also check if columns in the database are in utf-8. It's not always needed but recomended.

Basically I do three things to work correctly with czech language:
1) define locale in PHP:
setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");
so you would use something like:
setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");
based on language which is currently switched to.
2) define charset for the database:
mysql_query("set names latin2 collate latin2_czech_cs");
3) define the charset of PHP/HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8 (based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.

Try one of the following:
AddDefaultCharset UTF-8
AddCharset UTF-8 .php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.