Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.
Related
Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?
If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?
Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.
There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.
You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 ¤ just by looking at it - because they're the exact same bytes.
For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
There's always iconv, which is generally enabled in PHP by default
<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>
In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.
mbstring extension provides enhanced support for Simplified Chinese,
Traditional Chinese, Korean, and Russian in addition to Japanese.
I tried displaying a Japanese character (which I copied from www.google.co.jp) on my PHP page and it displayed fine. Do I need to use mbstring when I'm displaying UTF-32 characters?
EDIT:
<?php
echo "भ";
$s = strlen("भ");
echo $s;
?>
How do I make the second line of code to work?
PS: I have changed PHP default charset to UTF-8.
You need the mb_ string functions in place of the regular string functions, e.g. mb_substr instead of substr. If you don't use the regular string functions, there's no use for the mb_ functions either.
If you're just passing text through and PHP isn't doing anything with that text, there's no need for the mb_ functions.
To make the mb_ functions work correctly, you'll have to tell them what encoding your text is in. They support many different encodings, without telling them which you're using their results will be incorrect. You can pass that encoding to each mb_ function call, e.g. mb_strlen($str, 'UTF-8'), or you can set it once for all mb_ functions using mb_internal_encoding('UTF-8').
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for a comprehensive introduction.
I was wondering what is the internals of Phalcon using with regards to UTF8?
For instance if I use something like this
echo strlen('hello'); // output 5
However
echo strlen('汉字/漢字'); // will output something like 10
strlen is not UTF8 compatible so one has to use the mb_strlen to be safe.
Does Phalcon use (internally) mb_* related functions? If not how can we ensure that everything internally is handled in a UTF8 manner to ensure compatibility with all languages?
Thanks!
Currently, PHP is binary safe,that means you can work with multibyte strings (like utf8 or other charsets), latin1 or ascii in a transparent way.
Phalcon, only uses strlen when working with directory names (not sure if anyone is using directories with multi-byte characters).
I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_eregDocs, mb_eregiDocs, mb_ereg_replaceDocs and mb_eregi_replaceDocs.
PCRE based regular expression functions like preg_matchDocs support UTF-8 by using the u-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.