strtolower() for unicode/multibyte strings - php

I have some text in a non-English/foreign language in my page,
but when I try to make it lowercase, it characters are converted into black diamonds containing question marks.
$a = "Երկիր Ավելացնել";
echo $b = strtolower($a);
//returns ����� ���������
I've set my charset in a metatag, but this didn't fix it.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
What can I do to convert my string to lowercase without corrupting it?

Have you tried using mb_strtolower()?

PHP5 is not UTF-8 compatible, so you still need to resort to the mb extension. I suggest you set the internal encoding of mb to utf-8 and then you can freely use its functions without specifying the charset all the time:
mb_internal_encoding('UTF-8');
...
$b = mb_strtolower($a);
echo $b;

i have found this solution from here
$string = 'Թ';
echo 'Uppercase: '.mb_convert_case($string, MB_CASE_UPPER, "UTF-8").'';
echo 'Lowercase: '.mb_convert_case($string, MB_CASE_LOWER, "UTF-8").'';
echo 'Original: '.$string.'';
works for me (lower case)

Have you tried mb_strtolower() and specifying the encoding as the second parameter?
The examples on that page appear to work.
You could also try:
$str = mb_strtolower($str, mb_detect_encoding($str));

Php by default does not know about utf-8. It assumes any string is ASCII, so it strtolower converts bytes containing codes of uppercase letters A-Z to codes of lowercase a-z. As the UTF-8 non-ascii letters are written with two or more bytes, the strtolower converts each byte separately, and if the byte happens to contain code equal to letters A-Z, it is converted. In the result the sequence is broken, and it no longer represents correct character.
To change this you need to configure the mbstring extension:
http://www.php.net/manual/en/book.mbstring.php
to replace strtolower with mb_strtolower or use mb_strtolower direclty. I any case, you need to spend some time to configure the mbstring settings to match your requirements.

Use mb_strtolower instead, as strtolower doesn't work on multi-byte characters.

strtolower() will perform the conversion in the currently selected locale only.
I would try mb_convert_case(). Make sure you explicitly specify an encoding.

You will need to set the locale; see the first example at http://ca3.php.net/manual/en/function.strtolower.php

Related

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .
As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);
PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Manipulating Thai Characters in PHP

I'm struggling getting Thai characters and PHP working together. This is what I'd like to do:
<?php
mb_internal_encoding('UTF-8');
$string = "ทาง";
echo $string[0];
?>
But instead of giving me the first character of $string (ท), I just get some messed up output. However, displaying $string itself works fine.
File itself is of course UTF-8 as well. Content-Type in Header is also set to UTF-8. I changed the neccessary lines in php.ini according to this site.
utf8_encoding() and utf8_decoding() also don't help. Maybe any of you has an idea?
In PHP When you access a string with $string[0] it doesn't return the fist character, but the first byte.
You should use mb_substr instead. For example:
mb_substr($string, 0, 1, 'UTF-8');
Note: Since you are using mb_internal_encoding('UTF-8'); you may as well ignore the last parameter.
This happens because PHP is not aware of the encoding a string is in (that is: the encoding is not stored in the string object). So it will treat it as ANSI/ASCII by default. If you don't want that, then you must use the Multibyte String Function (mb_*).
When you set mb_internal_encoding('UTF-8'); you are telling it to use UTF-8 for all the Multibyte String Function, but not for anything else.

substr doesn't work fine with utf8

I am using a substr method to access the first 20 characters of a string. It works fine in normal situation, but while working on rtl languages (utf8) it gives me wrong results (about 10 characters are shown). I have searched the web but found nth useful to solve this issue. This is my line of code:
substr($article['CBody'],0,20);
Thanks in advance.
If you’re working with strings encoded as UTF-8 you may lose
characters when you try to get a part of them using the PHP substr
function. This happens because in UTF-8 characters are not restricted
to one byte, they have variable length to match Unicode characters,
between 1 and 4 bytes.
You can use mb_substr(), It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
Try this:
$str = mb_substr($article['CBody'], 0, 20, 'UTF-8');
echo utf8_decode($str);
Hope this helps.
Use this instead, here is extra text to make the body long enough. This will handle multi-byte characters.
http://php.net/manual/en/function.mb-substr.php

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

How to get the exact number of multibyte characters?

I tried:
mb_strlen('普通话');
strlen('普通话');
both of them output 9,while in fact there are only 3 characters.
What's the right way to count characters?
you should make sure to specify the encoding in the second parameter
ie
mb_strlen('普通话', 'UTF-8');
see the manual
If you don't have access to the mb string extension this also works (and I believe it's faster):
strlen(utf8_decode('普通话')); // 3
One Chinese character doesn't equal to one ascii character.
mb_strlen is the right way to count multi-byte characters if the string in UTF-8 encoded.
see here:
http://www.herongyang.com/PHP-Chinese/Multibyte-UTF-8-mb_strlen.html

Categories