Accessing chars in multibyte string php

Accessing chars in multibyte string php - php

I have mbstring.func_overload = 7 and using UTF-8. Everything works fine but this not:
$str = "ãçéíõ";
echo $str[0];
It prints a question mark in the browser.
This instead works normally:
echo substr($str,0,1);
Someone knows why?

Indexing into the string with $str[0] pulls bytes out of it. It cannot be made aware of encodings, no matter that mbstring.func_overload has been set so. You will need to use substr even if it is not as convenient.
Indexing into a string is a grievous coding error unless that string represents a blob, and you just came upon the reason.

Yes, it's because you are using multibyte strings, in which a single character is represented by one to four bytes. If you select just one byte (as in $str[0]) you probably have only a half character selected.
substr() instead is multibyte save and doesn't count the bytes, but the chars.

Related

how to count the occurrences of a Unicode character in a string?

how do you count the occurrences of a Unicode character in a string with PHP?
maybe this is a simple questions but I am a biginner in PHP.
I want to count how many Unicode characters U+06cc are in a string.
Character 'yeh' in farsi corresponds to 2 code points.
ی = u+06cc
ي = u+064a
that u+064a is a substitute in Farsi.
The popular character Arabic charset CP-1256 has no character mapped into U+06cc.
now I want to count how many Unicode characters U+06cc are in a string to detect that string is arabic or farsi.
when I use $count = substr_count($str, "ى"); or when I use
$count = substr_count($str, "\xDB\x8c");
it counts both "ی" and "ي" ,
any idea ?

I suppose you have a UTF-8 string, since UTF-8 is the most reasonable Unicode encoding.
$count = substr_count($str, "\xDB\x8C");
is what you want. You simply treat the string as a sequence of bytes. In UTF-8 the first byte of a multibyte character and its continuation bytes can never be mixed up (the first byte is always 11...... binary, while continuation bytes are always 10......). This ensures you cannot find something different from what your are looking for.
To find the UTF-8 encoding of U+06CC I used the fileformat.info website, which I think is the best for this purpose.
If you use UTF-8 in your IDE too, you can simply write "ى" instead of "\xDB\x8C" (internally they are exactly the same string in PHP), but that will make the readability of what you have written dependent on the IDE (often not good if you need to share your code).
Now that you have clarified your question, my above answer is no more appropriate. I leave it there just as a reference for other passers-by.
Your problem could stem from the fact that, reading here it seems that "ي" can lose its dots below if modified by the Unicode character U+0654 (the non-spacing mark "Arabic hamsa above"). Since my browser does not remove the dots, and adds the hamsa, I don't know whether the hamsa is supposed to disappear too when the dots disappear. Anyway, it COULD be that "\xDB\x8C" has the same appearance as "\xD9\x8A\xD9\x94". I have not been able to find the reverse, i.e., the double dot below as a non-spacing modification character, which would explain why substr_count($str, "\xDB\x8c") finds the Arabic yeh too - but maybe it exists.

I have tried this example, and it works fine:
$str="مىمى";
$count = substr_count($str, "ى");
echo $count;
I got the answer 2 , which is true.
If you want a more specific answer, you should provide more specific details in your question.

Substr not working with html tags and entities

I have gone throught the following question:
substr() not working but it did not work for me :(
I am facing the same problem. I am using nicEditor and for at the time of insert, I do htmlentities(addslashes(urlencode($description)))
and when I view the description? It shows me correctly, but when i use substr() it returns nothing.
like:
substr($description,0,10)
$description contains the content and it is fine, present in db, works without substr()

Please provide a var_dumb()
of $description and a bit more code before $description is filled in, so we can see if there is an other problem.
Try this one
Use mb_substr for multibyte character encodings like UTF-8. substr
just counts bytes while mb_substr counts characters.
substr() works with singlebyte only
http://php.net/manual/en/function.mb-substr.php
Source: PHP Substr Function Trimming Problem
This happens because in UTF-8 characters are not restricted to one
byte, they have variable length to match Unicode characters, between 1
and 4 bytes.
A safe way of cutting these strings without losing anything is by
using the mb_substr PHP function instead. It works almost the same way
as substr but the difference is that you can add a new parameter to
specify the encoding type, whether is UTF-8 or a different encoding.
Source: http://osc.co.cr/extracting-a-substring-from-a-utf-8-string-in-php/

substr doesn't work fine with utf8

I am using a substr method to access the first 20 characters of a string. It works fine in normal situation, but while working on rtl languages (utf8) it gives me wrong results (about 10 characters are shown). I have searched the web but found nth useful to solve this issue. This is my line of code:
substr($article['CBody'],0,20);
Thanks in advance.

If you’re working with strings encoded as UTF-8 you may lose
characters when you try to get a part of them using the PHP substr
function. This happens because in UTF-8 characters are not restricted
to one byte, they have variable length to match Unicode characters,
between 1 and 4 bytes.
You can use mb_substr(), It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
Try this:
$str = mb_substr($article['CBody'], 0, 20, 'UTF-8');
echo utf8_decode($str);
Hope this helps.

Use this instead, here is extra text to make the body long enough. This will handle multi-byte characters.
http://php.net/manual/en/function.mb-substr.php

Why use multibyte string functions in PHP?

At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?

All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.

Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php

You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.

PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.

multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).

Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.

How to get the exact number of multibyte characters?

I tried:
mb_strlen('普通话');
strlen('普通话');
both of them output 9,while in fact there are only 3 characters.
What's the right way to count characters?

you should make sure to specify the encoding in the second parameter
ie
mb_strlen('普通话', 'UTF-8');
see the manual

If you don't have access to the mb string extension this also works (and I believe it's faster):
strlen(utf8_decode('普通话')); // 3

One Chinese character doesn't equal to one ascii character.
mb_strlen is the right way to count multi-byte characters if the string in UTF-8 encoded.
see here:
http://www.herongyang.com/PHP-Chinese/Multibyte-UTF-8-mb_strlen.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.