htmlentities, htmlspecialchars, and "invalid multibyte sequence" - php

This question tells me
htmlentities is identical to htmlspecialchars() in all ways, except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.
Sounds like htmlentities is the one I want.
Then this question tells me I need the "UTF-8" argument to get rid of this error:
Invalid multibyte sequence in argument
So, here is my encoding wrapper function (to normalise behaviour across different PHP versions)
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "UTF-8");
}
I am still getting the "multibyte sequence in argument" error.
Here is a sample string which triggers the error, and it's hex encoding:
Jigue à Baptiste
4a 69 67 75 65 20 e0 20 - 42 61 70 74 69 73 74 65
I notice that the à is encoded as 0xe0 but as a single byte which is above 0x80.
What am I doing wrong?

Your string is encoded in ISO-8859-1, not UTF-8. Plain and simple.
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "ISO-8859-1");
^^^^^^^^^^^^
}

If à is encoded as 0xE0 then you didn't save the file in UTF-8 encoding. 0xE0 is invalid UTF-8. It should be 0xC3 0xA0
Save your file in UTF-8 encoding. Also see UTF-8 all the way through
If you saved it correctly in utf-8, the hex should look like so:
4A 69 67 75 65 20 C3 A0 20 42 61 70 64 69 73 74 65
J i g u e à B a p t i s t e

Related

UTF-8 issue with PHP's json_decode

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.
This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.
try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

PHP gettext and non-ANSII charters

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?
Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

Convert two string to the same byte length

I have 2 strings in my PHP code, 1 is a parameter to my method and 1 is a string from an ini file.
The problem is that they are not equal, although they have the same content, probably due to encoding issues. When using var_dump, it is reported that the first string's lenght is 23 and the second string's length is 47 (see the end of my question for the reason behind this)
How can i make sure they are both encoded the same way and have the same length in the end so comparison won't fail? Preferably, i would like them to be utf8 encoded.
For reference, this is an excerpt from the code:
static function getString($keyword,$file) {
$lang_handle = parse_ini_file($file, true);
var_dump($keyword);
foreach ($lang_handle as $key => $value) {
var_dump($key);
if ($key == $keyword) {
foreach ($value as $subkey => $subvalue) {
var_dump("\t" . $subkey . " => " . $subvalue);
}
}
}
}
with the following ini:
[clientcockpit/login.php]
header = "Kunden Login"
username = "Benutzername"
password = "Passwort"
forgot = "Passwort vergessen"
login = "Login"
When calling the method with getString("clientcockpit/login.php", "inifile.ini") the output is:
string 'clientcockpit/login.php' (length=23)
string '�c�l�i�e�n�t�c�o�c�k�p�i�t�/�l�o�g�i�n�.�p�h�p�' (length=47)
Your INI file seems to be in UTF16 encoding or similar, using two bytes to represent a single character. I guess that the strange characters in your string are actually NULL bytes (\0).
PHP's Unicode support is quite poor and I guess that parse_ini_file() does not support multibyte encodings properly. It will treat the file as if it was encoded using a "ASCII-compatible" single-byte encoding, just looking for special characters [ and ] to detect sections. As a result, the section keys will be corrupted: One byte actually belonging to [ or ] will be part of the section key:
UTF-16: [c] (3 characters, 6 bytes)
For UTF-16BE (big endian):
Bytes: 00 5B 00 63 00 5D (6 bytes)
ASCII: \0 [ \0 c \0 ] (6 characters)
For UTF-16LE (little endian):
Bytes: 5B 00 63 00 5D 00 (6 bytes)
ASCII: [ \0 c \0 ] \0 (6 characters)
Assuming ASCII, instead of reading c, parse_ini_file() will read \0c\0 if the source file encoding is UTF-16.
If you can control the format of your INI file, make sure to save it in UTF8 or ISO-8859-1 encoding, using your favorite text editor.
Otherwise you will have to read in the file contents using file_get_contents(), do the encoding conversion (eg. using iconv()) and pass the result to parse_ini_string(). The drawback here is that you will have to detect or hardcode the original file encoding.
If the mb multibyte extension is available on your PHP installation, you can use mb_detect_encoding() and mb_convert_encoding() to do the conversion dynamically.
Try this:
$lang_handle = parse_ini_string(file_get_contents($file), true);

Why call mb_convert_encoding to sanitize text?

This is in reference to this (excellent) answer. He states that the best solution for escaping input in PHP is to call mb_convert_encoding followed by html_entities.
But why exactly would you call mb_convert_encoding with the same to and from parameters (UTF8)?
Excerpt from the original answer:
Even if you use htmlspecialchars($string) outside of HTML tags, you are still vulnerable to multi-byte charset attack vectors.
The most effective you can be is to use the a combination of mb_convert_encoding and htmlentities as follows.
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
Does this have some sort of benefit I'm missing?
Not all binary data is valid UTF8. Invoking mb_convert_encoding with the same from/to encodings is a simple way to ensure that one is dealing with a correctly encoded string for the given encoding.
A way to exploit the omission of UTF8 validation is described in section 6 (security considerations) in rfc2279:
Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
This may be more easily understood by examining the binary representation:
110xxxxx 10xxxxxx # header bits used by the encoding
11000000 10101110 # C0 AE
00101110 # 2E the '.' character
In other words: (C0 AE - header-bits) == '.'
As the quoted text points out, C0 AE is not a valid UTF8 octet sequence, so mb_convert_encoding would have removed it from the string (or translated it to '.', or something else :-).

Strange characters in PHP

This is driving me crazy.
I have this one php file on a test server at work which does not work.. I kept deleting stuff from it till it became
<?
print 'Hello';
?>
it outputs
Hello
if I create a new file and copy / paste the same script to it it works!
Why does this one file give me the strange characters all the time?
That's the BOM (Byte Order Mark) you are seeing.
In your editor, there should be a way to force saving without BOM which will remove the problem.
Found it, file -> encoding -> UTF8 with BOM , changed to to UTF :-)
I should ahve asked before wasing time trying to figure it out :-)
Just in case, here is a list of bytes for BOM
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
UTF-32 (BE) 00 00 FE FF
UTF-32 (LE) FF FE 00 00
UTF-7 2B 2F 76, and one of the following bytes: [ 38 | 39 | 2B | 2F ]†
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 optionally followed by FF†

Categories