PHP utf-8 best practices and risks for distributed web applications

PHP utf-8 best practices and risks for distributed web applications - php

I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.

First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

Related

mbstring - what's the purpose of using mb_language('uni'); mb_internal_encoding('UTF-8'); at the beginning of a PHP script?

Sometimes I see some PHP scripts with the following lines at the beginning:
<?php
mb_language('uni');
mb_internal_encoding('UTF-8');
I know these two functions come from the mb_string PHP module. But what is the effective purpose of calling these two functions at the beginning of the script?
I read the docs, http://php.net/manual/en/function.mb-internal-encoding.php, a user says:
Especially when writing PHP scripts for use on different servers, it
is a very good idea to explicitly set the internal encoding somewhere
on top of every document served, e.g.
mb_internal_encoding("UTF-8");
This, in combination with mysql-statement "SET NAMES 'utf8'", will
save a lot of debugging trouble.
Also, use the multi-byte string functions instead of the ones you may
be used to, e.g. mb_strlen() instead of strlen(), etc.
But how much should I worry about the encoding of: DB connection (I use the UTF-8 charset for my tables and call SET NAMES utf8; as soon as I connect to the database), HTTP request input values and output (especially when dealing with multibyte characters like those in the Japanese language), email sending, regular expression patterns to search text, etc... when writing an i18n PHP application and how do these mb_* functions really help me?
I also read this post PHP string functions vs mbstring functions and as I understand it the user who answers the question says that mb_* functions should be avoided:
a simple replacement of the 8-bit string functions with their mb_*
counterparts will cause nothing but trouble.
Thank you for your attention. Insights and clarifications are welcome.

Used for encoding e-mail messages. Valid languages are "Japanese", "ja","English","en" and "uni" (UTF-8). mb_send_mail() uses this setting to encode e-mail.
Copied from http://php.net/manual/en/function.mb-language.php

iconv() Vs. utf8_encode()

when you have a charset different of UTF-8 and you need to put it on JSON format to migrate it to a DB, there are two methods that can be used in PHP, calling utf8_encode() and iconv(). I would like to know which one have better performance, and when is convenient to use one or another.

when you have a charset different of UTF-8
Nope - utf8_encode() is suitable only for converting a ISO-8859-1 string to UTF-8. Iconv provides a vast number of source and target encodings.
Re performance, I have no idea how utf8_encode() works internally and what libraries it uses, but my prediction is there won't be much of a difference - at least not on "normal" amounts of data in the bytes or kilobytes. If in doubt, do a benchmark.
I tend to use iconv() because it's clearer that there is a conversion from character set A to character set B.
Also, iconv() provides more detailed control on what to do when it encounters invalid data. Adding //IGNORE to the target character set will cause it to silently drop invalid characters. This may be helpful in certain situations.

I recommend you to write your own function.
It will be 2-3 lines long and it will be better than struggling with locale, iconv etc. issues.
For example:
Fix Turkish Charset Issue Html / PHP (iconv?)

Declaration to make PHP script completely Unicode-friendly

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.

All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)

Change Website Character encoding from iso-8859-1 to UTF-8

About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?

The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.

Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.

Is it possible to write a great PHP app which uses Unicode?

My next web application project will make extensive use of Unicode. I usually use PHP and CodeIgniter however Unicode is not one of PHP's strong points.
Is there a PHP tool out there that can help me get Unicode working well in PHP?
Or should I take the opportunity to look into alternatives such as Python?

PHP can handle unicode fine once you make sure to encode and decode on entry and exit. If you are storing in a database, ensure that the language encodings and charset mappings match up between the html pages, web server, your editor, and the database.
If the whole application uses UTF-8 everywhere, decoding is not necessary. The only time you need to decode is when you are outputting data in another charset that isn't on the web. When outputting html, you can use
htmlentities($var, ENT_QUOTES, 'UTF-8');
to get the correct output. The standard function will destroy the string in most cases. Same goes for mail functions too.
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet is a very good resource for working in UTF-8

One of the Major feature of PHP 6 will be tightly integrated with UNICODE support.
Implementing UTF-8 in PHP 5.
Since PHP strings are byte-oriented, the only practical encoding scheme for Unicode text is UTF-8. Tricks are [Got it from PHp Architect Magazine]:
Present HTML pages in UTF-8
Convert PHP scripts to UTF-8
Convert the site content, back-end databases and the like to UTF-8
Ensure that no PHP functions corrupt the UTF-8 text
Check out http://www.gravitonic.com/talks/ PHP UTF 8 Cheat Sheet

PHP is mostly unaware of chrasets and treats strings as bytestreams. That's not much of a problem really, but you'll have to do a bit of work your self.
The general rule of thumb is that you should use the same charset everywhere. If you use UTF-8 everywhere, then you're 99% there. Just make sure that you don't mix charsets, because then it gets really complicated. The only thing that won't work correct with UTF-8, is string manipulation, which needs to operate on a character level. Eg. strlen, substr etc. You should use UTF-8-aware versions in place of those. The multibyte-string extension gives you just that.
For a checklist of places where you need to make sure the charset is set correct, look at:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
For more information, look at:
http://www.phpwact.org/php/i18n/utf-8

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.