mbstring extension provides enhanced support for Simplified Chinese,
Traditional Chinese, Korean, and Russian in addition to Japanese.
I tried displaying a Japanese character (which I copied from www.google.co.jp) on my PHP page and it displayed fine. Do I need to use mbstring when I'm displaying UTF-32 characters?
EDIT:
<?php
echo "भ";
$s = strlen("भ");
echo $s;
?>
How do I make the second line of code to work?
PS: I have changed PHP default charset to UTF-8.
You need the mb_ string functions in place of the regular string functions, e.g. mb_substr instead of substr. If you don't use the regular string functions, there's no use for the mb_ functions either.
If you're just passing text through and PHP isn't doing anything with that text, there's no need for the mb_ functions.
To make the mb_ functions work correctly, you'll have to tell them what encoding your text is in. They support many different encodings, without telling them which you're using their results will be incorrect. You can pass that encoding to each mb_ function call, e.g. mb_strlen($str, 'UTF-8'), or you can set it once for all mb_ functions using mb_internal_encoding('UTF-8').
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for a comprehensive introduction.
Related
Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?
If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?
Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.
There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.
You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 ¤ just by looking at it - because they're the exact same bytes.
For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
There's always iconv, which is generally enabled in PHP by default
<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>
In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.
I was wondering what is the internals of Phalcon using with regards to UTF8?
For instance if I use something like this
echo strlen('hello'); // output 5
However
echo strlen('汉字/漢字'); // will output something like 10
strlen is not UTF8 compatible so one has to use the mb_strlen to be safe.
Does Phalcon use (internally) mb_* related functions? If not how can we ensure that everything internally is handled in a UTF8 manner to ensure compatibility with all languages?
Thanks!
Currently, PHP is binary safe,that means you can work with multibyte strings (like utf8 or other charsets), latin1 or ascii in a transparent way.
Phalcon, only uses strlen when working with directory names (not sure if anyone is using directories with multi-byte characters).
I've been searching for UTF8-safe alternatives for string manipulation functions. I've found many different opinions and suggestions. I would like to ask if following functions can cause problems in UTF-8 and if does, what should I use instead. I know the list of mb_ prefixed functions in PHP manual, but there are not all functions I am using.
Functions are: implode, explode, str_replace, preg_match, preg_replace
Thank you
explode just looks for an identical byte sequence and separates the string at that point. Since UTF-8 is safely backwards compatible with ASCII, there's no concern and it will work fine. implode just assembles strings together, which works fine as well due to the properties of UTF-8. str_replace works for the same reasons. The preg_ functions work fine as long as you are using the /u modifier.
If you need to safely manipulate with UTF8 characters, you can do it like this:
mb_internal_encoding('UTF-8');
preg_replace( '`...`u', '...', $string ) // with the u (unicode) modifier
Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.