That is, if I'm coding something entirely in PHP4? Or perhaps I should use a custom function or class/extension instead to count the number of characters in a multibyte string?
Only difference I can spot is that mb_string strips out bad sequences, while iconv_strlen doesn't.
If you want a drop-in replacement for plain strlen, use mb_strlen as it always returns an int. This is very debatable though (iconv's correctness over mb's tolerance), but in practice mb_strlen's fault tolerance served me better. Just make sure you configure mb to the desired encoding either in php.ini or in a central place in your application.
Unicode support in PHP is in a bad place, you have to be aware of many pitfalls and exceptions. Having done a complete switch of several large applications and their user data to UTF-8, I could cry you a river.
Related
So I've posted several questions related to making already existing software written in PHP to be updated to support unicode / utf8. One of the solutions is to override PHP's default string functions with PHP's mb_string functions. However, I see a lot of people talking about negative consequences, yet no one really elaborates on them. Can someone please explain what these negative consequences are?
Why is it "bad" to override PHP's default string functions with its mb_string functions? It's after all much simpler than replacing all those functions with their corresponding mb_ functions manually. So what am I missing? What are these negative consequences?
It's bad to override them because if some other developer comes and works on this code then it might do something that he wasn't expecting. It's always good to use the default functions as they were intended.
I think mb_* family function are heavier as they also perform unicode test as well even of simple ascii string. So on big scale they will slow down your application speed. (May not be on much significance, but somehow definitely.)
I'll try to elaborate.
Overloading the standard string functions with mb_* will have dire consequences for anything reading and dealing with binary files, or binary data in general. If you overload the standard function, then suddenly strlen($binData) is bound to return the wrong length at some point.
Why?
Imagine the binary data contains a byte with the value in the ranges 0xC0-0xDF, 0xE0-0xEF or 0xF0-0xF7. Those are Unicode start bytes, and now the overloaded strlen will count the following characters as 1 byte, rather than the 2, 3, and 4 they should have been respectively.
And the main problem is that mbstring.func_overload is global. It doesn't just affect your own script, but all scripts, and any frameworks or libraries they may use.
When asked, should I enable mbstring.func_overload. The answer is always, and SHOULD always be a resounding NO.
You are royally screwed if you use it, and you will spend countless hours hunting bugs. Bugs that may very well be unfixable.
Well, you CAN call mb_strlen($string, 'latin1') to get it to behave, but it still contains an overhead. strlen uses the fact that php strings are like Java strings; they know their own length. mb_strlen parses the string to count the bytes.
Some years ago, I built a good custom PHP CMS Site, but I overlooked one important issue: unicode support. This was primarily due to the fact that at the time, the users were English-speaking and that was to remain the case for the foreseeable future. Another factor was PHP's poor unicode support to begin with.
Well, now the day of reckoning has come. I want there to be support for unicode, specifically UTF8, but I have one major obstacle: PHP's string functions. Correct me if I'm wrong, but even now, in the world of PHP 5.5, PHP's regular string functions (i.e. strlen, substr, str_replace, strpos, etc) do not fully support unicode. On the other hand, PHP's mb_string functions do support unicode, but I have read that they may be rather resource heavy (which makes sense since we would be dealing with multibyte characters as opposed to single byte characters).
So, the way I see it, there are three solutions:
1) Use multibyte string functions in all cases.
A. Try to override the standard string functions with their multibyte counterparts. Speaking of which, were I to do this, what is the best way to do so?
B. Painstakingly go through all my code and replace the standard string functions with their multibyte function counterparts.
2) Painstakingly go through all my code and replace standard string functions that would work with user input, database data, etc with their multibyte function counterparts. This would require me to look at every usage of every string function carefully in my code to determine whether it has even the slightest chance of dealing with multibyte characters.
The benefit of this is that I would have the optimal running time while at the same time fully supporting unicode. The drawback here is that this would be very time-consuming (and extremely boring, I might add) to implement and there would always be the chance I'd miss using a multibyte string function where I should.
3) Overhaul my software entirely and start from scratch. But this is something I'm trying to avoid.
If there are other options available, please let me know.
I'd go for a variation of 1.B:
1.B.2) Use an automatical "Search and Replace" function (a single carefully crafted sed command might do it).
Reason for 1 in favor of 2: premature optimization is the root of all evil. I don't know where you read that the mb_ functions were "resource heavy" but plainly spoken it's utter nonsense. Of course they take a few more CPU cycles but that is a dimension that you really should not worry about. For some reason PHP developers love to discuss about such micro optimization like "are single quotes faster than double quotes" while they should focus on the things that really make a difference (mostly I/O and database). Really, it's not worth any effort.
Reason for automation: it's possible, it's more efficient, do you need more arguments?
i am wondering when i check the length of the string with strlen function in php
it says its 32. however when i open that in a notepad or notepad++ i can see some empty spaces in the file (incorrect alignment with the other lines,
since i am doing a banking application strlen is critical to be 100% correct.
i am just wondering should i keep trusting this output or should i moved to something else like
mb_strlen
what is the best function to be trusted in normal ASCII character set ?
If you want to use strlen() or mb_strlen() depends on your servers setup, if it uses the mb_string extension or not. Though usually strlen() is overloaded with mb_strlen() anyway in those cases. Apart from that the computed values are 100% correct.
The difference you might see is how text (especially multibyte strings) is visualized if you open it in a text displaying application. It is much more likely that you get a missleading information there.
I currently use mbstring.func_overload = 7 to get working with UTF-8 charset.
I am thinking to refactor all func call to use mb_* functions.
Do you think this is necessarily, or with PHP 6 or newer version the multibyte problem will be solved in another way?
Not recommended if you are using the libraries other people create. Here are three reasons.
Overloading can break the behaviors of the libraries that don't expect overloading.
Your framework can be broken in the environments without overloading.
Depending on overloading decreases the prospective users of your framework because of 2
Good example of 1. is miscaliculation of bytesize in HTTP Content-Length field by using strlen. The cause is that the overloaded strlen function does not return the number of bytes but number of characters. You can see real world issues in CakePHP and Zend_Http_Client.
Edit:
deprecating mbstring.func_overload is under consideration in PHP 5.5 or 5.6 (from mbstring maintainer's mail in 2012 April). So now you should avoid mbstring.func_overload.
The recommended policy of handling mutibyte characters for various platforms is to use mbstring or intl or iconv directlly. If you really need fallback functions for handling multibyte characters, use function_exists().
You can see the cases in Wordpress and MediaWiki.
WordPress: wp-includes/compact.php
MediaWiki: Fallback Class
Some of CMSes like Drupal (unicocde.inc) introduce mutibyte abstraction layer.
I think the abstraction layer is not good idea.
The reason is that the number of handling multibyte functions needed in a lot of case is under 10 and umultibyte functions are easy to use and decrease perfomance for switching the handling to mbstring or intl or iconv if these module are installed.
The minimum requirement for handling multibyte characters is mb_substr() and handling invalid byte sequence.
You can see the cases of a fallback function for mb_substr() in the above CMSes.
I answered about handling invalid byte sequence in the following place: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems
for string that are utf-8 (of course)
Yes, of course. There are many things you can do with strings though. UTF-8 is backwards compatible with ASCII. If you only want to operate on the ASCII characters of a string, it may or may not make a difference. It depends on what you need to do with your strings.
If you want a direct answer: No, you should not refactor every function to an mb_ function, because it's likely overkill. Should you check your use cases whether a multi-byte UTF-8 string may impact results and refactor accordingly? Yes.
Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).
That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)