In PHP 5.6 onwards the default_charset string is set to "UTF-8" as explained e.g. in the php.ini documentation. It says that the string is empty for earlier versions.
As I am creating a Java library to communicate with PHP, I need to know which values I should expect when a string is handled as bytes internally. What happens if the default_charset string is empty and a (literal) string contains characters outside the range of ASCII? Should I expect the default character encoding of the platform, or the character encoding used for the source file?
Short answer
For literal strings -- always source file encoding. default_charset value does nothing here.
Longer answer
PHP strings are "binary safe" meaning they do not have any internal string encoding. Basically string in PHP are just buffers of bytes.
For literal strings e.g. $s = "Ä" this means that string will contain whatever bytes were saved in file between quotes. If file was saved in UTF-8 this will be equivalent to $s = "\xc3\x84", if file was saved in ISO-8859-1 (latin1) this will be equivalent to $s = "\xc4".
Setting default_charset value does not affect bytes stored in strings in any way.
What does default_charset do then?
Some functions, that have to deal with strings as text and are encoding aware, accept $encoding as argument (usually optional). This tells the function what encoding the text is encoded in a string.
Before PHP 5.6 default value of these optional $encoding arguments were either in function definition (e.g. htmlspecialchars()) or configurable in various php.ini settings for each extension separately (e.g. mbstring.internal_encoding, iconv.input_encoding).
In PHP 5.6 new php.ini setting default_charset was introduced. Old settings were deprecated and all functions that accept optional $encoding argument should now default to default_charset value when encoding is not specified explicitly.
However, developer is left responsible to make sure that text in string is actually encoded in encoding that was specified.
Links:
Details of the String Type
More details on nature of PHP strings (does not mention default_charset at the time of writing).
New features in PHP 5.6: Default character encoding
Short introduction of new default_charset option in PHP 5.6 release notes.
Deprecated features in PHP 5.6: iconv and mbstring encoding settings
List of deprecated php.ini options in favour of default_chaset option.
It seems you should not rely on the internal encoding. The internal character encoding can be seen/set with mb_internal_encoding.
example phpinfo()
PHP Version 5.5.9-1ubuntu4.5
default_charset no value
file1.php
<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1
file2.php
<?php
$string = "É";
echo mb_internal_encoding(); //ISO-8859-1
both files will output ISO-8859-1 if you do not change the internal encoding manually.
<?php
echo bin2hex("ö"); //c3b6 (utf-8)
Getting the hex of this character returns UTF-8 encoding. If you save the file using UTF-8 the string in this example will have 2 bytes, even if the internal encoding is not set to UTF-8. Therefore you should rely on the character encoding used for the source file.
Related
I downloaded the library http://phpqrcode.sourceforge.net/ and wrote simplest code for it
include('./phpqrcode/qrlib.php');
QRcode::png('иванов иван иванович 11111');
But resulted qr code contains only half of string
Resulted qr code - 'иванов иван ив';
url - vologda-oblast.ru/coronavirus/qr/parampng.php
What can be wrong?
The "phpqrcode" library in your case encodes a number of characters instead of the number of bytes of a UTF-8 string. That’s why the string is truncated. If you QR-encode English-only text, the string will not be truncated. The truncation occurs only with Cyrillic characters since it takes 2 bytes to encode each Cyrillic character in UTF-8 rather than just a single byte for a Latin one.
Interestingly, the demo example of the library on the author’s page do encode Cyrillic characters correctly.
The truncation happens in your case because you are using the following options in your php.ini file:
mbstring.func_overload = 2
mbstring.internal_encoding = "UTF-8"
If you remove the mbstring.func_overload (deprecated since PHP 7.2.0) from php.ini or set it 0, the "phpqrcode" library will start working properly. Otherwise, the strlen() function used by the library will return number of characters rather than the number of bytes in a UTF8-ecoded octet string, while str_split(), another function used by the library, will always return the number of bytes since it is not affected by mbstring.func_overload. As a result, your QR-codes will contain truncated strings.
Since you are using the Bitrix Site Manager CMS, removing the mbstring.func_overload from php.ini may be problematic until you fully update Bitrix to 20.5.393 (released on September 2020) or later version. Earlier version did rely on this deprecated feature. You can find find more information about Bitrix reliance on this deprecated feature at https://idea.1c-bitrix.ru/remove-dependency-on-mbstring-settingsfuncoverload/ or https://idea.1c-bitrix.ru/?tag=4799
Since you cannot change php.ini configuration on run-time, you can try to configure your web server to have php options configure on a per-directory level. Failing that, you can fix the code of the "phpqrcode" library to work correctly, at least partially, in your case, to not rely on the strlen() function. To to that, edit the qrencode.php file the following way. First, replace the $eightbit constant of the QREncode class from false to true. Second, in the function encodeString8bit, replace
$ret = $input->append(QR_MODE_8, strlen($string), str_split($string));
to
$arr = str_split($string);
$len = count($arr);
$ret = $input->append(QR_MODE_8, $len, $arr);
Anyway, since the "phpqrcode" library does not currently support Extended Channel Interpretations (ECI) mode, you cannot reliably encode Cyrillic characters with the library. It uses the 8-bit string mode of storing text in a QR code, which by default may only contain ISO-8859-1 (Latin-1) characters unless the default character set is modified by a ECI entry. But the library cannot insert the ECI entry into a QR code to show that the text has UTF-8 encoding rather than ISO-8859-1. Some decoding applications will auto-detect the wrong charset and show the string correctly, while some (compliant) may not.
As a conclusion, since the "phpqrcode" does not currently support ECI, you cannot reliably encode Cyrillic characters with it, but you can at least make it not truncate the string as I have shown above.
I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...
I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.
What I am missing?
The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.
The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).
Does this clear up the behavior you see?
You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.
Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?
For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.
But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.
Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).
According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.
I am trying to convert my existing PHP webpage to use UTF-8 encoding.
To do so, I have done the following things:
specified UTF-8 as the charset in the meta content tag at the start of my webpage.
change the default_charset to UTF-8 in the php.ini.
specified UTF-8 as the iconv encoding in the php.ini file.
specified UTF-8 in my .htaccess file using: AddDefaultCharset UTF-8.
Yet after all that, when i echo mb_internal_encoding(), it shows as ISO-8859-1. What am I missing here? I know I could use auto_prepend to attach a script that changes the default encoding to UTF-8, but I'm just trying to understand what I'm missing.
Thanks
mb_internal_encoding() doesn't effect the output of your scripts per se, it effects the default encoding when using the multibyte string functions and the conversion of POST and GET inputs.
Simply set with
mbstring.internal_encoding='UTF-8'
in your php.ini file, or programmatically with:
mb_internal_encoding('UTF-8');
Speaking of the mb_ functions, you'll need to rewrite your scripts to use these, e.g. mb_strlen() instead of strlen.(), etc.
Also check what HTTP content-type headers are being outputted, though from what you've done it should be ok.
If you using a database, you'll also have to convert that too, and specify that you're using UTF-8 when connecting to it.
The documentation states that you can SET that variable using
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
which should get rid of your problem :)