How to get the exact number of multibyte characters? - php

I tried:
mb_strlen('普通话');
strlen('普通话');
both of them output 9,while in fact there are only 3 characters.
What's the right way to count characters?

you should make sure to specify the encoding in the second parameter
ie
mb_strlen('普通话', 'UTF-8');
see the manual

If you don't have access to the mb string extension this also works (and I believe it's faster):
strlen(utf8_decode('普通话')); // 3

One Chinese character doesn't equal to one ascii character.
mb_strlen is the right way to count multi-byte characters if the string in UTF-8 encoded.
see here:
http://www.herongyang.com/PHP-Chinese/Multibyte-UTF-8-mb_strlen.html

Related

PHP html_entity_decode is not working for UTF-8 characters? [duplicate]

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1�2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).
need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence � is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1�2.
No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.

PHP strpos says different croatian chars are the same: š č

I have the following code:
$text = 'Tomáš'
echo strpos($text, "č");
# result if 4
I believe they are different chars so why is PHP telling me they are the same?
What is going on and how can I correct this?
The encoding you chose to save your source code file in cannot encode the characters you're trying to save. Whatever characters PHP is seeing, it's not comparing the strings you think it is. Save your source code in an encoding that can encode all characters, preferably UTF-8.
You should try with mb_strpos function.
Performs a multi-byte safe strpos() operation based on number of characters. The first character's position is 0, the second character position is 1, and so on.
With a regular setup, it returns false to me.
However if you've troubles with such special characters, using mb_strpos instead of strpos should help.
http://php.net/manual/en/function.mb-strpos.php

how to count the occurrences of a Unicode character in a string?

how do you count the occurrences of a Unicode character in a string with PHP?
maybe this is a simple questions but I am a biginner in PHP.
I want to count how many Unicode characters U+06cc are in a string.
Character 'yeh' in farsi corresponds to 2 code points.
ی = u+06cc
ي = u+064a
that u+064a is a substitute in Farsi.
The popular character Arabic charset CP-1256 has no character mapped into U+06cc.
now I want to count how many Unicode characters U+06cc are in a string to detect that string is arabic or farsi.
when I use $count = substr_count($str, "ى"); or when I use
$count = substr_count($str, "\xDB\x8c");
it counts both "ی" and "ي" ,
any idea ?
I suppose you have a UTF-8 string, since UTF-8 is the most reasonable Unicode encoding.
$count = substr_count($str, "\xDB\x8C");
is what you want. You simply treat the string as a sequence of bytes. In UTF-8 the first byte of a multibyte character and its continuation bytes can never be mixed up (the first byte is always 11...... binary, while continuation bytes are always 10......). This ensures you cannot find something different from what your are looking for.
To find the UTF-8 encoding of U+06CC I used the fileformat.info website, which I think is the best for this purpose.
If you use UTF-8 in your IDE too, you can simply write "ى" instead of "\xDB\x8C" (internally they are exactly the same string in PHP), but that will make the readability of what you have written dependent on the IDE (often not good if you need to share your code).
Now that you have clarified your question, my above answer is no more appropriate. I leave it there just as a reference for other passers-by.
Your problem could stem from the fact that, reading here it seems that "ي" can lose its dots below if modified by the Unicode character U+0654 (the non-spacing mark "Arabic hamsa above"). Since my browser does not remove the dots, and adds the hamsa, I don't know whether the hamsa is supposed to disappear too when the dots disappear. Anyway, it COULD be that "\xDB\x8C" has the same appearance as "\xD9\x8A\xD9\x94". I have not been able to find the reverse, i.e., the double dot below as a non-spacing modification character, which would explain why substr_count($str, "\xDB\x8c") finds the Arabic yeh too - but maybe it exists.
I have tried this example, and it works fine:
$str="مىمى";
$count = substr_count($str, "ى");
echo $count;
I got the answer 2 , which is true.
If you want a more specific answer, you should provide more specific details in your question.

substr doesn't work fine with utf8

I am using a substr method to access the first 20 characters of a string. It works fine in normal situation, but while working on rtl languages (utf8) it gives me wrong results (about 10 characters are shown). I have searched the web but found nth useful to solve this issue. This is my line of code:
substr($article['CBody'],0,20);
Thanks in advance.
If you’re working with strings encoded as UTF-8 you may lose
characters when you try to get a part of them using the PHP substr
function. This happens because in UTF-8 characters are not restricted
to one byte, they have variable length to match Unicode characters,
between 1 and 4 bytes.
You can use mb_substr(), It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
Try this:
$str = mb_substr($article['CBody'], 0, 20, 'UTF-8');
echo utf8_decode($str);
Hope this helps.
Use this instead, here is extra text to make the body long enough. This will handle multi-byte characters.
http://php.net/manual/en/function.mb-substr.php

Accessing chars in multibyte string php

I have mbstring.func_overload = 7 and using UTF-8. Everything works fine but this not:
$str = "ãçéíõ";
echo $str[0];
It prints a question mark in the browser.
This instead works normally:
echo substr($str,0,1);
Someone knows why?
Indexing into the string with $str[0] pulls bytes out of it. It cannot be made aware of encodings, no matter that mbstring.func_overload has been set so. You will need to use substr even if it is not as convenient.
Indexing into a string is a grievous coding error unless that string represents a blob, and you just came upon the reason.
Yes, it's because you are using multibyte strings, in which a single character is represented by one to four bytes. If you select just one byte (as in $str[0]) you probably have only a half character selected.
substr() instead is multibyte save and doesn't count the bytes, but the chars.

Categories