Is Phalcon UTF8 compliant? - php

I was wondering what is the internals of Phalcon using with regards to UTF8?
For instance if I use something like this
echo strlen('hello'); // output 5
However
echo strlen('汉字/漢字'); // will output something like 10
strlen is not UTF8 compatible so one has to use the mb_strlen to be safe.
Does Phalcon use (internally) mb_* related functions? If not how can we ensure that everything internally is handled in a UTF8 manner to ensure compatibility with all languages?
Thanks!

Currently, PHP is binary safe,that means you can work with multibyte strings (like utf8 or other charsets), latin1 or ascii in a transparent way.
Phalcon, only uses strlen when working with directory names (not sure if anyone is using directories with multi-byte characters).

Related

How to convert a Chinese character to UTF-16 code units?

I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.
You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.
If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));

PHP: parsing ascii string safely when running in multibyte mode

In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.

When do I need to enable mbstring in PHP?

mbstring extension provides enhanced support for Simplified Chinese,
Traditional Chinese, Korean, and Russian in addition to Japanese.
I tried displaying a Japanese character (which I copied from www.google.co.jp) on my PHP page and it displayed fine. Do I need to use mbstring when I'm displaying UTF-32 characters?
EDIT:
<?php
echo "भ";
$s = strlen("भ");
echo $s;
?>
How do I make the second line of code to work?
PS: I have changed PHP default charset to UTF-8.
You need the mb_ string functions in place of the regular string functions, e.g. mb_substr instead of substr. If you don't use the regular string functions, there's no use for the mb_ functions either.
If you're just passing text through and PHP isn't doing anything with that text, there's no need for the mb_ functions.
To make the mb_ functions work correctly, you'll have to tell them what encoding your text is in. They support many different encodings, without telling them which you're using their results will be incorrect. You can pass that encoding to each mb_ function call, e.g. mb_strlen($str, 'UTF-8'), or you can set it once for all mb_ functions using mb_internal_encoding('UTF-8').
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for a comprehensive introduction.

let htmlspecialchars use UTF-8 as default charset?

Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
I have already setted this:
ini_set('mbstring.internal_encoding','UTF-8');
ini_set('mbstring.func_overload',7);
If not, please can you post a list of all functions where I need to specify the charset?
(I need this because I am re-factorizing all my framework to get working with UTF-8)
Just use htmlspecialchars() instead of htmlentities(). Because it doesn't touch the non-ASCII characters, it doesn't matter whether you use 'utf8' charset or the default 'latin1'(*), the results are the same. As a bonus your output is smaller. (Though it does mean you have to ensure you're actually serving your page with the correct encoding.)
(*: there are a few East Asian multibyte charsets which can differ in their use of ASCII code points, so if you're using those you would still need to pass a $charset argument to htmlspecialchars(). But certainly no such problem for UTF-8.)
Is there a way to tell PHP to use UTF-8 as default for functions like htmlspecialchars ?
Nope, not as far as I know. mbstring.internal_encoding will define a default encoding for the mb_* family of functions only.
If not, please can you post a list of all functions where I need to specify the charset?
I'm not sure whether such a list exists - if in doubt, just walk through the manual and look out for any charset parameters.

When should I use mb_strpos(); over strpos();?

Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.

Categories