Some years ago, I built a good custom PHP CMS Site, but I overlooked one important issue: unicode support. This was primarily due to the fact that at the time, the users were English-speaking and that was to remain the case for the foreseeable future. Another factor was PHP's poor unicode support to begin with.
Well, now the day of reckoning has come. I want there to be support for unicode, specifically UTF8, but I have one major obstacle: PHP's string functions. Correct me if I'm wrong, but even now, in the world of PHP 5.5, PHP's regular string functions (i.e. strlen, substr, str_replace, strpos, etc) do not fully support unicode. On the other hand, PHP's mb_string functions do support unicode, but I have read that they may be rather resource heavy (which makes sense since we would be dealing with multibyte characters as opposed to single byte characters).
So, the way I see it, there are three solutions:
1) Use multibyte string functions in all cases.
A. Try to override the standard string functions with their multibyte counterparts. Speaking of which, were I to do this, what is the best way to do so?
B. Painstakingly go through all my code and replace the standard string functions with their multibyte function counterparts.
2) Painstakingly go through all my code and replace standard string functions that would work with user input, database data, etc with their multibyte function counterparts. This would require me to look at every usage of every string function carefully in my code to determine whether it has even the slightest chance of dealing with multibyte characters.
The benefit of this is that I would have the optimal running time while at the same time fully supporting unicode. The drawback here is that this would be very time-consuming (and extremely boring, I might add) to implement and there would always be the chance I'd miss using a multibyte string function where I should.
3) Overhaul my software entirely and start from scratch. But this is something I'm trying to avoid.
If there are other options available, please let me know.
I'd go for a variation of 1.B:
1.B.2) Use an automatical "Search and Replace" function (a single carefully crafted sed command might do it).
Reason for 1 in favor of 2: premature optimization is the root of all evil. I don't know where you read that the mb_ functions were "resource heavy" but plainly spoken it's utter nonsense. Of course they take a few more CPU cycles but that is a dimension that you really should not worry about. For some reason PHP developers love to discuss about such micro optimization like "are single quotes faster than double quotes" while they should focus on the things that really make a difference (mostly I/O and database). Really, it's not worth any effort.
Reason for automation: it's possible, it's more efficient, do you need more arguments?
Related
So I've posted several questions related to making already existing software written in PHP to be updated to support unicode / utf8. One of the solutions is to override PHP's default string functions with PHP's mb_string functions. However, I see a lot of people talking about negative consequences, yet no one really elaborates on them. Can someone please explain what these negative consequences are?
Why is it "bad" to override PHP's default string functions with its mb_string functions? It's after all much simpler than replacing all those functions with their corresponding mb_ functions manually. So what am I missing? What are these negative consequences?
It's bad to override them because if some other developer comes and works on this code then it might do something that he wasn't expecting. It's always good to use the default functions as they were intended.
I think mb_* family function are heavier as they also perform unicode test as well even of simple ascii string. So on big scale they will slow down your application speed. (May not be on much significance, but somehow definitely.)
I'll try to elaborate.
Overloading the standard string functions with mb_* will have dire consequences for anything reading and dealing with binary files, or binary data in general. If you overload the standard function, then suddenly strlen($binData) is bound to return the wrong length at some point.
Why?
Imagine the binary data contains a byte with the value in the ranges 0xC0-0xDF, 0xE0-0xEF or 0xF0-0xF7. Those are Unicode start bytes, and now the overloaded strlen will count the following characters as 1 byte, rather than the 2, 3, and 4 they should have been respectively.
And the main problem is that mbstring.func_overload is global. It doesn't just affect your own script, but all scripts, and any frameworks or libraries they may use.
When asked, should I enable mbstring.func_overload. The answer is always, and SHOULD always be a resounding NO.
You are royally screwed if you use it, and you will spend countless hours hunting bugs. Bugs that may very well be unfixable.
Well, you CAN call mb_strlen($string, 'latin1') to get it to behave, but it still contains an overhead. strlen uses the fact that php strings are like Java strings; they know their own length. mb_strlen parses the string to count the bytes.
I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.
First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.
I have an application that has so far been in English only. Content encoding throughout templates and database has been UTF-8. I am now looking to internationalize/translate the application into languages that have character sets absolutely needing UTF-8.
The application uses various PHP string functions such as strlen(), strpos(), substr(), etc, and my understanding is that I should switch these for multi-byte string functions such as mb_strlen(), mb_strlen(), mb_substr(), etc, in order for multi-byte characters to be handled correctly. I've tried to read around this topic a little but virtually everything I can find goes deep into "encoding theory" and doesn't provide a simple answer to the question: If I'm using UTF-8 throughout, can I switch from using strlen() to mb_strlen() and expect things to work normally in for example both English and Arabic, or is there something else I still need to look out for?
Any insight would be welcome, and apologies if I'm offending someone who has encoding close to their heart with my relative ignorance.
No. Since bytearrays are also strings in PHP, a simple replacement of the 8-bit string functions with their mb_* counterparts will cause nothing but trouble. Functions like strlen() and substr() are probably more frequently used with bytes than actual text strings.
At the place I last worked, we managed to build a multilingual web-site (Arabic, Hindi, among other languages) in PHP without using the mbstring library at all. Text string manipulation actually doesn't happen that often. When it does, it would require far more care than just changing a function name. Most of the challenges, I've found, lie on the HTML side. Getting a page layout to work with a RTL language is the non-trivial part.
I don't know if you're just using Arabic as an example. The difficulty of internationalization can vary quite substantially depending on whether "international" means European languages only (plus Russian), or if it's inclusive of Middle-Eastern, South-Asian, and Far-East languages.
Check the status of the mbstring.func_overload flag in php.ini
If (ini_get('mbstring.func_overload') & 2) then functions like strlen() (as listed here) are already overloaded by the mb_strlen() function, so there is no need for you to call the mb_* functions explicitly.
The number of multibyte functions really needed are under 10, so create 3 or 5 questions whether the usage of the function or logic is good. This quesiton is obsecure and hard to answer. Small questions can get quick answers. Concrete questions can bring out good answers. let me know when you create other questions.
If you need use cases, see the fallback functions in CMSes such as Wordpress, MediaWiki, Drupal.
When you decide to start using mbstring, You should avoid using mbstring.func_overload directive. Mbstring maintainers are going to deprecate mbstring.func_overload in PHP 5.5 or 5.6 (see PHP core mailing list in 2012 April). mbstring.func_overload breaks the codebases that are not expected to use mbstring.func_overload. you can see the cases in CakePHP, Zend Framework 1x in caliculating Content-Length by using strlen().
I answerd the similar question in another place: Should i refactor all my framework to use mbstring functions?
I currently use mbstring.func_overload = 7 to get working with UTF-8 charset.
I am thinking to refactor all func call to use mb_* functions.
Do you think this is necessarily, or with PHP 6 or newer version the multibyte problem will be solved in another way?
Not recommended if you are using the libraries other people create. Here are three reasons.
Overloading can break the behaviors of the libraries that don't expect overloading.
Your framework can be broken in the environments without overloading.
Depending on overloading decreases the prospective users of your framework because of 2
Good example of 1. is miscaliculation of bytesize in HTTP Content-Length field by using strlen. The cause is that the overloaded strlen function does not return the number of bytes but number of characters. You can see real world issues in CakePHP and Zend_Http_Client.
Edit:
deprecating mbstring.func_overload is under consideration in PHP 5.5 or 5.6 (from mbstring maintainer's mail in 2012 April). So now you should avoid mbstring.func_overload.
The recommended policy of handling mutibyte characters for various platforms is to use mbstring or intl or iconv directlly. If you really need fallback functions for handling multibyte characters, use function_exists().
You can see the cases in Wordpress and MediaWiki.
WordPress: wp-includes/compact.php
MediaWiki: Fallback Class
Some of CMSes like Drupal (unicocde.inc) introduce mutibyte abstraction layer.
I think the abstraction layer is not good idea.
The reason is that the number of handling multibyte functions needed in a lot of case is under 10 and umultibyte functions are easy to use and decrease perfomance for switching the handling to mbstring or intl or iconv if these module are installed.
The minimum requirement for handling multibyte characters is mb_substr() and handling invalid byte sequence.
You can see the cases of a fallback function for mb_substr() in the above CMSes.
I answered about handling invalid byte sequence in the following place: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems
for string that are utf-8 (of course)
Yes, of course. There are many things you can do with strings though. UTF-8 is backwards compatible with ASCII. If you only want to operate on the ASCII characters of a string, it may or may not make a difference. It depends on what you need to do with your strings.
If you want a direct answer: No, you should not refactor every function to an mb_ function, because it's likely overkill. Should you check your use cases whether a multi-byte UTF-8 string may impact results and refactor accordingly? Yes.
That is, if I'm coding something entirely in PHP4? Or perhaps I should use a custom function or class/extension instead to count the number of characters in a multibyte string?
Only difference I can spot is that mb_string strips out bad sequences, while iconv_strlen doesn't.
If you want a drop-in replacement for plain strlen, use mb_strlen as it always returns an int. This is very debatable though (iconv's correctness over mb's tolerance), but in practice mb_strlen's fault tolerance served me better. Just make sure you configure mb to the desired encoding either in php.ini or in a central place in your application.
Unicode support in PHP is in a bad place, you have to be aware of many pitfalls and exceptions. Having done a complete switch of several large applications and their user data to UTF-8, I could cry you a river.