Russian characters from hex to string utf8 - getting the wrong characters - php

I am trying to pass hex-encoded parameters to an image-creating script. All documents are in utf8. Everything is fine until I go through the string in a loop. See the minimized example:
$string="ABCDЯ";
for($i=0;$i<strlen($string);$i++) {
echo $string[$i]."<br>"
}
gives the output:
A
B
C
D
�
instead of
A
B
C
D
Я
Why is that? Since I want to analyze the characters in the string, it fails at this point, because all Russian characters end up as �.

In manual:
The string in PHP is implemented as an array of bytes and an integer
indicating the length of the buffer. It has no information about how
those bytes translate to characters, leaving that task to the
programmer.
So, you're iterating $string byte by byte. If a character is not encoded with single-byte, the correct result won't be returned.
Given that PHP does not dictate a specific encoding for strings, one
might wonder how string literals are encoded. For instance, is the
string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C
form), "\x61\xCC\x81" (UTF-8, D form) or any other possible
representation? The answer is that string will be encoded in whatever
fashion it is encoded in the script file.
You can use mb_substr to get a character when iterating $string.
<?php
$string = 'ABCDЯ';
for($i = 0; $i < strlen($string); $i++) {
echo mb_substr($string, $i, 1, 'UTF-8') . '<br>';
}

Related

PHP Unicode to character conversion

I receive country names like from a library: "\u00c3\u0096sterreich".
How do I convert this to Österreich?
Using PHP 7.3
This one is a lot trickier than it seem, but the below code appears to work.
First we pipe it through the standard regex for Unicode escape sequences, then pack that as a binary string, convert the encoding and finally decode. I cannot promise this is the best way to do this, but it appears to be working correct as far as I can tell.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return utf8_decode(mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'));
}, $str);
Demo here
The Unicode for the UTF-8 character "Ö" is U+00D6.
This character consists of the 2 hex bytes: c3 and 96.
The representation \u00c3 \u0096 for these 2 bytes is a bit strange. Provided that the multibyte character is represented byte for byte, the following code can also be used.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback(
'~\\\\u00([0-9a-f]{2})~i',
function($m){
return hex2bin($m[1]);
},
$str
);
//Test
$expect = "Österreich";
var_dump($str === $expect); //bool(true)
In case anyone else ends up here with a similar issue, I thought I'd try and shed some light on what's going on. Because as mentioned, this is a lot more complicated that it might look.
A string like \u00c3 refers to a Unicode code-point, in hexadecimal. Ö in the Unicode table is character 214, or \u00d6.
The 214 here is not directly related to how Ö is actually stored in any particular encoding (UTF-8, UTF-16, etc), it's just an abstract number in the overall Unicode table that refers to that character. UTF-8, for instance, will store it in two bytes 11000010 10010110 (194 150 in decimal). There's a really good explanation of how this works in this answer, if you're interested in the finer details.
What appears to have happened in your string is that these two bytes have then been encoded back into hexadecimal, and returned as two separate Unicode code points. u00c3 is Ã, and \u0096 is a control character. This is why any standard methods of decoding this (json_decode, etc) won't have worked - ultimately what you have is not a valid representation of the string Österreich.
The other answers should both work perfectly well, but this code snippet might better illustrate the issue with the format your library is using. It specifically matches two consecutive low Unicode code-points, recombines their decimal representations into an unsigned two-byte integer, and then returns the result.
$str = '\u00c3\u0096sterreich';
echo preg_replace_callback('/\\\\u00([0-9a-fA-F]{2})\\\\u00([0-9a-fA-F]{2})/', function ($match) {
$i = (hexdec($match[1]) << 8) + hexdec($match[2]);
return pack('N', $i);
}, $str);
Österreich
See https://3v4l.org/QtUuGD

What encoding is the resulting string if I concatenate a UTF-8 encoded string with an ASCII string in PHP?

If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it? Are there any negative consequences for doing this?
It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Every single one of these characters has the exact same encoding in the ASCII encoding scheme as it does in the UTF-8 encoding scheme. For these characters, the mb_convert_encoding function will have no effect. You can easily verify this yourself with this script:
/* Convert ASCII to UTF-8 */
for ($i=0; $i<128; $i++) {
$str1 = chr($i);
$str2 = mb_convert_encoding($str1, "UTF-8", "ASCII");
echo $str1 . " - " . $str2 . " - ";
if ($str1 !== $str2) {
echo " - DIFFERENT!";
} else {
echo " - same";
}
echo "\n";
}
For all of these true ASCII characters, there's no point in transcoding them.
HOWEVER, if by "ASCII" you mean extended ASCII (see here) and are talking about characters with accents and stuff, then you are getting into trouble because there is no definitive character set described by this term. You'll notice that in the list of supported character encodings for php's Multibyte String extension there is only one occurrence of the acronym ASCII and that is for ASCII itself.
To answer your questions more precisely:
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it?
The resulting string is both ASCII and UTF-8 because both encoding schemes use identical byte encodings for those 128 characters.
Are there any negative consequences for doing this?
There should be no negative consequences under any circumstance if the characters are in fact true ASCII characters.
If, on the other hand, the strings include some accented character like Å or õ and some sloppy coder is calling this "extended ASCII" then you might have problems. Those characters have different encodings in the latin-1 and UTF-8 encoding schemes, for instance.
Consider taking a peek at this php function and it may shake loose some understanding. Ask yourself what it means to convert a character which is NOT ASCII from ASCII to UTF-8. It is not a meaningful conversion but it does result in a change in this particular script:
$chars = array("Å", "õ");
foreach ($chars as $char) {
echo $char . " : ";
$str1 = mb_convert_encoding($str1, "UTF-8", "ASCII");
$str2 = mb_convert_encoding($str1, "UTF-8", "ISO-8859-1");
echo $str1 . " - " . $str2 . " - ";
if ($char !== $str1) {
echo " - ASCII DIFFERENT";
}
if ($char !== $str2) {
echo " - LATIN 1 DIFFERENT";
}
echo "\n";
}
You might start to get confused at this point. It might help for you to know that my PHP code in that last function has its own character encoding which on my workstation happens to be utf-8. These transformations I've performed are therefore pretty stupid. I'm lying to PHP, saying that these UTF-8 strings are ASCII or Latin-1 and asking PHP to transform them to UTF-8. It performs a transformation as best it can but we all know that transformation isn't meaningful.
I hope you can appreciate what I'm getting at here. Every time you see a character on a computer, it has some encoding. Whether or not there are any negative consequences will depend on how you treat the data that comes to you, what transformations you perform on it, and what you intend to do with it later.
It's helpful to think of a chain of custody. Where did your data come from? What encoding did they use? Is that what I'm using on my system? Where am I sending this data? Does it need to be converted? You should also be careful to specify character sets for all these things:
data you receive from clients
form submissions to your website
display of html on your website
operations on text strings in your applications
character encoding of your connection to a database, character encoding of the tables in your db and encodings of the columns in those tables
character encoding of stored data
email character encoding
character encoding of data submitted to an API
And so on.
General rule of thumb: use utf-8 for everything you possibly can.
ASCII is a subset of UTF-8, so an ASCII string is a valid UTF-8 string. Concatenating two UTF-8 strings is unambiguous.

What changes my UTF-8 string to ASCII?

I have the following code:
$string = $this->getTextFromHTML($html);
echo mb_detect_encoding($string, 'ASCII,UTF-8,ISO-8859-1');
$stringArray = mb_split('\W+', $string);
$cleaned = array();
foreach($stringArray as $v) {
$string = trim($v);
if(!empty($string))
array_push($cleaned, $string);
}
echo mb_detect_encoding($stringArray[752], 'ASCII,UTF-8,ISO-8859-1');
The above returns:
// UTF-8
// ASCII
What part of my code is turning my string into ASCII? Or am I detecting the encoding incorrectly?
Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.
Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.
As #Phylogenesis mentioned in the comments, ASCII characters under 0x7F are valid UTF-8. Unless you have a byte order mark in your data, the text is both valid ASCII and UTF-8. You've specified that ASCII is an option before UTF-8, so it is returned.
For example: https://ideone.com/DupS4A
<?php
$str = "apple";
// Returns ASCII
var_dump(mb_detect_encoding($str, "ASCII, UTF-8"));
// 0xEFBBBF is the byte order mark in UTF-8
$str_with_bom = chr(0xEF) . chr(0xBB) . chr(0xBF) . "apple";
// Returns UTF-8
var_dump(mb_detect_encoding($str_with_bom, "ASCII, UTF-8"));

printf() Extended Unicode Characters?

$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065
%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).

Wrong output when using array indexing on UTF-8 string

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
I am expecting to see ü, but I get � -- why?
Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.
What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
You can see that $string[0] gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
While mb_substr gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
And these two bytes are in fact just one character (you need to use mb_strlen for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
You can see most of the above in action.

Categories