I'm updating a PHP app which imports CSV encoded in UTF-16 (from Google Keyword Planner) and the values are converted to UTF-8.
Until PHP 8 it's working as expected, but from PHP 8.1 there is a ? added to the values after the conversion from UTF-16 to UTF-8:
var_dump(mb_convert_encoding("\0008\0008\0000\000", "UTF-8", "UTF-16"));
// Output with PHP 8.1.3 - 8.1.13, 8.2.0:
// string(4) "880?"
// Output with PHP 7.4.32, 8.0.8 - 8.0.26:
// string(3) "880"
Your source equals to "\x00\x38\x00\x38\x00\x30\x00", which is 7 bytes and as such an invalid length for UTF-16, which always needs 2 or 4 bytes per character.
You're lucky enough PHP7 did silently accept the first 6 bytes and drop the 7th,
while PHP8 now produces a more correct output as per UTF-16 LE and wants to tell you that there is an imcomplete 4th character, because there's only 1 byte for it.
Solution: provide proper input. Maybe it's also because you misunderstood the octal notation and would see it much better without mixing notation and literals altogether:
approach
only 6 bytes (value '880')
make it 8 bytes (value '8800'
full hexadecimal notation
"\x00\x38\x00\x38\x00\x30"
"\x00\x38\x00\x38\x00\x30\x00\x30"
mixed hexadecimal notation
"\x008\x008\x000"
"\x008\x008\x000\x000"
full octal notation
"\000\070\000\070\000\060"
"\000\070\000\070\000\060\000\060"
mixed octal notation
"\0008\0008\0000"
"\0008\0008\0000\0000"
concatenated string to make it more clear
"\x00". '8'. "\x00". '8'. "\x00". '0'
"\x00". '8'. "\x00". '8'. "\x00". '0'. "\x00". '0'
Related
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.
This question already has answers here:
strlen() and UTF-8 encoding
(6 answers)
Closed 4 years ago.
I have a string with this content :
$myString = 'Câmara de Dirigentes Lojistas';
This string have 29 chars. BUT when i call strlen, it returns 30 ! Even when i call var_dump($myString), that's the result :
114:string 'Câmara de Dirigentes Lojistas' (length=30)
What is going on here ? Maybe the problem is related to the special char â ?
That's the right behavior since you are using UTF-8 encoding.
Please see this note on strlen() documentation
Note:
strlen() returns the number of bytes rather than the number of characters in a string.
As your string have multi-byte characters (â), PHP uses two bytes to represent it.
To have the right string length, you must use the mb_strlen() function:
mb_strlen("â"); // 1
strlen("â"); // 2
There are several definitions of the "length" of a string, because there are a variety of tricks used to represent the huge range of accented characters, variants, and non-alphabetic scripts used around the world.
The number of bytes the string takes up. This is the easiest to calculate, but not always what is expected. For instance, in UTF-16, every code point takes up either 2 or 4 bytes; in UTF-8, code points take up 1, 2, 3, or 4 bytes. This is what strlen and most PHP functions work with.
The number of "code points": separate symbols in the character set. This is the next easiest, and the next most common, but is generally a compromise between bytes and "graphemes" (see below) - there aren't many cases where it's particularly useful to count é as 2 "characters" just because it's represented with a combining diacritic. In PHP you can use mb_strlen to count these, telling it your string's character encoding.
The number of "graphemes": separate symbols a reader would recognise. This is the most intuitive meaning, but the hardest for a computer to define. In PHP you can use grapheme_strlen, as long as you have ensured your string is encoded as UTF-8.
There is an issue with the character â as it is a special character which uses a different encoding. Characters like this are actually double characters this is why its giving 30 and not 29
To fix this, you need to use mb_strlen() with encoding
$myString = 'Câmara de Dirigentes Lojistas';
echo mb_strlen($myString,'utf8')
NOTE : If mb_strlen is undefined, then you will have to enable mb extension in your php settings
Interestingly the â char exists in extended ascii, i.e. it can be represented by just one byte, you can try it with this code:
$str = utf8_decode('Câmara de Dirigentes Lojistas');
echo 'length is ' . strlen($str);
that will output length is 29.
So as you see the thing is that when a char is not plain ascii (127 char ascii table) then PHP assumes UTF-8 automatically.
I am looking to create a variable length UTF-8 string in PHP. I tried this method initially:
utf8_encode(mcrypt_create_iv(256, MCRYPT_DEV_URANDOM))
But it gave me a string longer than 256 chars
I then tried this method:
md5(mcrypt_create_iv(32, MCRYPT_DEV_URANDOM))
Which works for a 32 character long string but if I want a 256 long string I need to concatenate it 8 times which seems inefficient.
How would I do this efficiently in PHP? Thanks
$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065
%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).
I am using iconv to convert string from CP1251 to UTF-8
Problem is that string length before conversion is 4 bytes, after 8.
After converting i send message to Apple servers, where is length is limited.
How I can get conversion and keep the same length?
There is no way you can do it. In UTF-8 one-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0.
As you are trying to encode non-ASCII characters, you'll get more, then 1 byte per character.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
http://en.wikipedia.org/wiki/UTF-8#Overlong_encodings