I have a web-based tool, that is mainly used in English. I just received a question if it will support Thai language. Since it is mostly text based, what modifications do I need to make to support other languages, like Thai specifically?
First, you need to make sure you are using UTF8 encoding from end to end. That means in code, output, database and database connection. For PHP you should set mb_internal_encoding and mb_http_output to UTF-8.
Your output headers should be set UTF8. For example:
Content-Type:text/html; charset=UTF-8
If you are using MySQL, make sure your tables are UTF-8, you can convert them with an ALTER TABLE. For safety, you should also set your db connection to UTF-8 by running SET NAMES UTF8 as the first query on every connection.
Once you have that, then you can support most languages (i.e. not Chinese). THEN you can remove all your text from your HTML and put it in some sort of lookup system. A common solution is the gettext support in PHP. Although you may want to come up with your own solution that can search and replace in bulk.
While it's not trivial to do all this, once it's done it's easy to support more languages. You just need to translate your text. Bing has a free/cheap translation service with good terms. Google used to, but changed their policies about a year ago. Google still have a translation service, but the terms did not work for my needs.
Your question is a big topic, but this information should point you in the right direction.
Related
I'm working on an application using the CakePHP framework and in the past I ran into a few encounters with encoding.
To avoid these issues in my application, I started doing some research. But I'm still a little confused about the how and why.
My application will need to support all languages, yes even languages such as Chineese. Most of the data will be stored into a MySQL database, and that's where confusion starts. What should I use as collation?
Based on what I've read the past few days, I come to the conclusion the best choice for collation would be utf8_unicode_ci. Is this correct?
Now onto the PHP, what would I set as encoding? UTF-8? I need to completely be sure not a single character shows up the way it shouldn't. Content will be submitted through forms, so the output has to be the same as the input.
I hope anyone can give me an answer to my questions and help clarify it to me, thanks in advance.
You need UTF-8 encoding to store you data. But as for collation, it is used to sort strings. Unfortunatelly, there exists no universal collation, and such universal collation can not exists, because collations are contradictory.
To make a point on example, in Czech 'ch' goes after 'h', opposite to most other Latin script languages.
Yes, utf8_unicode_ci is a sane choice when you don't know in advance the language. As for PHP I'll just link to some answers I wrote in the past:
How to best configure PHP to handle a UTF-8 website
Croatian diacritic signs in MySQL db (utf-8)
Am I correctly supporting UTF-8 in my PHP apps?
One additional advice would be to make sure your text editor saves all files as UTF-8 (NO BOM, if you have this option). In short, keep everything utf-8 from the very beginning and you should be safe.
I want to dvp a small web app which would ideally be used worldwide. For the sake of the discussion, let's say it's a recipe sharing site - it's a good enough metaphor.
My app will allow users to enter or upload text in their native languages. My html header says that the site uses utf-8 encoding. I am now creating my MySQL db, and I suppose that I should select utf8_unicode_ci for the char set & collation.
Is that correct?
Is that all I need to do to be able to receive, store, and display safe user-generated-content in their chosen language? If not, what am I missing?
(I am aware of the safety concerns associated with displaying UGC, this is not what the question is about - here I am solely looking for advice to deal with safe content.)
It is all for you HTML and DB part, but you must ensure that the programming language is UTF-8 aware so it doesn't garble your stuff. If you use PHP just make sure that the functions you use are UTF-8 aware. If it isn't the manual usually mentions it.
As far as the html and the db i think this is all you need.
The only other part you may need to define that your inputs are UTF-8 encoded, is the part where you send/receive your data (assuming with a form and a post request for example).
You can check post #:1281123 in this forum, it helped a lot when i had some problems with encoding in a similar situation.
I am creating a web base application using PHP and MySQL. I want it to be able to save any kind of user input characters, both English and non-English characters like Arabic or Japanese at the same time.
What should I do to achieve that?
You need to use Unicode. Read the MySQL manual section on Unicode and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You'll likely want to set the character set (encoding) of the table/columns in question to utf8. You'll also need to set the encoding of your HTML/PHP files to UTF-8. You can do this with a meta tag in <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
You can also set the Content-Type: header that Apache/PHP sends out.
Even after setting this, you may still run in to browser-specific issues. For example, Internet Explorer may not always use UTF-8, so Rails 3 had to put in a workaround.
For MySQL, you first need to define your data with the UTF8 character set:
CREATE DATABASE xx [...] DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE utf8_general_ci
And when creating database connections from PHP, you just need to run a quick command after opening it:
SET NAMES 'utf8'
Alternatively, if you have access to MySQL's my.ini, you can just add this to the config and forget the above:
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8
(note that's not php.ini, but MySQL's ini)
For PHP, if you need to manipulate multibyte strings: make sure you have the mbstring library active, and then change your string & regexp function calls to use the mb_* equivalent.
Also, make sure your editor is saving in UTF8 so everything's consistent. Eclipse/PDT makes it easy, at least (project -> properties -> text file encoding).
Finally, handling cultural differences: that's the hard part. Sometimes it's as easy as setting p { direction: rtl; } in CSS, and other times you'll be tearing your hair out trying to decipher what alphabet(s) a user just posted with. It depends on what you're doing with the different languages.
For starters, make sure that you read up on SQL injection. You would need to take strong precautions so that you safely encode the input. Usually, you'd be filtering/discarding unsafe content. So if you really need to allow it, then you need to be careful that you don't make it easy to hack yourself.
Essentially, you need the same sort of protection, while allowing "dangerous" content such as source code examples, that sites like this one use. Also systems that are commonly targeted such as PHPBB2, WordPress, Wiki, etc..
I think your task is harder if the data needs to be searchable.
If you are using PHP, the mysql_real_escape_string() function looks good:
http://www.tizag.com/mysqlTutorial/mysql-php-sql-injection.php
Otherwise, use somethign similar.
I have content stored in a Postgres DB, now everytime I call the content so that it gets displayed using php, i get funny squares in IE and funny square type question marks in Firefox?
Example below
* - March � May 2009
How do I remove this?
I do not have access to the server so can't adjust the encoding there, only have postgres DB details and FTP access to upload my files
I would also recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, I've read it only recently myself, it will definitely help you sort out your problems.
You need to make sure that Postgres, PHP, and your browser all agree on the content encoding, and that you have an appropriate font selected in your browser. The simplest way to do that is to choose UTF8 for everything.
I don't know about PHP, but I do know about databases and browsers. First you need to find out if the database is UTF8. (From psql, I would do a "\l" and look at the encoding.) Then you need to find out if PHP supports UTF8 (I have no idea how you do that). Then you need to see if how those characters are being stored in the database by the PHP app. Then you need to figure out if the web server is correctly reporting the content encoding. (On Linux/Unix, I'd use the program "HEAD" (not "head") to see the headers its returning.) And then you need to figure out if your browser is using a font that supports UTF8.
Or, you could just make sure you only store ASCII and forget the rest of the world exists. Not recommended.
Wrong charset somewhere. The characters could be stored wrong already in database, or you have wrong charset in meta tags on the page(try manually change charset in browser), or there could be problem with wrong encoding when page is communicating with database.
Check this page http://www.postgresql.org/docs/8.2/static/multibyte.html for more informations.
Try to have same encoding on all places, preferably UTF-8
You have encoding issues. Make sure the encoding is set right in the database, in the html markup and make sure the files themselves are saved in proper encoding.
It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.