I have encountered something bizarre in the mb_strwidth function; it may be a bug but I thought it better to ask here first in case I'm missing something.
Context
A class is being used to represent a generic string and is both iterable and seekable; with both iterations and seeks applying to the character within the string. The string has full multi-byte support, so when a new position is sought, it not only stores the character position, but recalculates the byte position in the string; like so:
$this->posByte = mb_strwidth(
mb_substr($this->value, 0, $pos, $this->charEncoding),
$this->charEncoding
)
Perceived Error
However, when a multi-byte character is introduced, this is returning an incorrect value. The test case is this:
$str = string('The simple sentence of the simple man; here are some multi-byte chars: Øðćă.', 'UTF-8')
$str->seek(72);
This seeks to the second multi-byte character 'ð', but the byte calculation given above returns 72, the same as the character position; whereas it should be 73 since the preceding character 'Ø' has a code point of U+00D8; which is 216 in decimal and firmly in the two-byte character range.
This is confirmed by using the multi-byte unaware function strlen() (since I have not enabled mb overloading); which simply counts the number of bytes in a string. This:
$bytePos = strlen(mb_substr($this->value, 0, $pos, $this->charEncoding));
returns 73 as expected.
Is this a known problem?
I can use strlen() for now as a workaround, but I don't particularly like doing so since enabling multi-byte overloading in the PHP config would then cause the errors to reappear; does anyone have any experience of a similar issue? Is PHP just using an out-of-date character mapping?
For the record, this is from a PHPUnit test run on a PHP 5.6.3 windows environment.
You appear to be misinterpreting the function of mb_strwidth. Its purpose has nothing to do with bytes, it merely gives you the visual width of a string according to a fixed table. This is purely interesting for Asian character sets with appropriate monospaced fonts, where latin characters, commas and other punctation are half-width and "regular" characters are full-width. Everything up to and including U+1FFF is 1.
You need to use strlen and other encoding-unaware functions to operate on strings in bytes, and mb_ functions to operate on them on a character level, to figure out your byte/character relationships.
If you're worried about the barbaric mb-overloading, either check the ini setting and refuse operation on insane systems, or use mb_strlen with a single-byte encoding set.
Related
in PHP, how can i convert UTF-8 to MUTF-8? i am hoping i can lazily just get away with
function utf8_to_mutf8(string $utf8):string{
return str_replace("\x00", "\xC0\x80", $utf8);
}
? given that all multi-byte characters in utf-8 have the high bit set, \x00 will never occur in any multi-byte character, and the following should be completely unnecessary?
function utf8_to_mutf8(string $utf8):string{
$old = mb_internal_encoding();
mb_internal_encoding("UTF-8");
$ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
mb_internal_encoding($old);
return $ret;
}
Yes, "\x00" will only occur for codepoint U+0000 and never for any other codepoint. Only all ASCII characters have the highest bit not set (U+0000 to U+007F = bits 00000000 to 01111111). Encountering bytes that have not the highest bit set can also be used for sychronization in case it is unclear where the next codepoint/character begins.
Yes, str_replace() is enough, because it is already binary safe, as said in the docs. Speak: it does neither care about the input's encoding, nor about global settings.
If your goal is to have a chain of bytes that will never ever have a "\x00" in it then you should achieve it this way.
Personally I think null terminations are outdated, and following the old Java way to work around that limitation just comes with the same disadvantages of not being able to use "\x00" in the first way. You just end up to unmodify your encoding again to let all UTF-8 handling properly deal with it.
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.
This question already has answers here:
strlen() and UTF-8 encoding
(6 answers)
Closed 4 years ago.
I have a string with this content :
$myString = 'Câmara de Dirigentes Lojistas';
This string have 29 chars. BUT when i call strlen, it returns 30 ! Even when i call var_dump($myString), that's the result :
114:string 'Câmara de Dirigentes Lojistas' (length=30)
What is going on here ? Maybe the problem is related to the special char â ?
That's the right behavior since you are using UTF-8 encoding.
Please see this note on strlen() documentation
Note:
strlen() returns the number of bytes rather than the number of characters in a string.
As your string have multi-byte characters (â), PHP uses two bytes to represent it.
To have the right string length, you must use the mb_strlen() function:
mb_strlen("â"); // 1
strlen("â"); // 2
There are several definitions of the "length" of a string, because there are a variety of tricks used to represent the huge range of accented characters, variants, and non-alphabetic scripts used around the world.
The number of bytes the string takes up. This is the easiest to calculate, but not always what is expected. For instance, in UTF-16, every code point takes up either 2 or 4 bytes; in UTF-8, code points take up 1, 2, 3, or 4 bytes. This is what strlen and most PHP functions work with.
The number of "code points": separate symbols in the character set. This is the next easiest, and the next most common, but is generally a compromise between bytes and "graphemes" (see below) - there aren't many cases where it's particularly useful to count é as 2 "characters" just because it's represented with a combining diacritic. In PHP you can use mb_strlen to count these, telling it your string's character encoding.
The number of "graphemes": separate symbols a reader would recognise. This is the most intuitive meaning, but the hardest for a computer to define. In PHP you can use grapheme_strlen, as long as you have ensured your string is encoded as UTF-8.
There is an issue with the character â as it is a special character which uses a different encoding. Characters like this are actually double characters this is why its giving 30 and not 29
To fix this, you need to use mb_strlen() with encoding
$myString = 'Câmara de Dirigentes Lojistas';
echo mb_strlen($myString,'utf8')
NOTE : If mb_strlen is undefined, then you will have to enable mb extension in your php settings
Interestingly the â char exists in extended ascii, i.e. it can be represented by just one byte, you can try it with this code:
$str = utf8_decode('Câmara de Dirigentes Lojistas');
echo 'length is ' . strlen($str);
that will output length is 29.
So as you see the thing is that when a char is not plain ascii (127 char ascii table) then PHP assumes UTF-8 automatically.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).
It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.
Yes, I think this is correct, at least I couldn't find any counter-example.