Multilingual Application Encoding - php

I'm working on an application using the CakePHP framework and in the past I ran into a few encounters with encoding.
To avoid these issues in my application, I started doing some research. But I'm still a little confused about the how and why.
My application will need to support all languages, yes even languages such as Chineese. Most of the data will be stored into a MySQL database, and that's where confusion starts. What should I use as collation?
Based on what I've read the past few days, I come to the conclusion the best choice for collation would be utf8_unicode_ci. Is this correct?
Now onto the PHP, what would I set as encoding? UTF-8? I need to completely be sure not a single character shows up the way it shouldn't. Content will be submitted through forms, so the output has to be the same as the input.
I hope anyone can give me an answer to my questions and help clarify it to me, thanks in advance.

You need UTF-8 encoding to store you data. But as for collation, it is used to sort strings. Unfortunatelly, there exists no universal collation, and such universal collation can not exists, because collations are contradictory.
To make a point on example, in Czech 'ch' goes after 'h', opposite to most other Latin script languages.

Yes, utf8_unicode_ci is a sane choice when you don't know in advance the language. As for PHP I'll just link to some answers I wrote in the past:
How to best configure PHP to handle a UTF-8 website
Croatian diacritic signs in MySQL db (utf-8)
Am I correctly supporting UTF-8 in my PHP apps?
One additional advice would be to make sure your text editor saves all files as UTF-8 (NO BOM, if you have this option). In short, keep everything utf-8 from the very beginning and you should be safe.

Related

How to deal with legacy app storing Arabian characters in latin1 database

I now work on a web-base PHP app to work with a MySQL Server database .
database is with Latin1 Character set and all Persian texts don't show properly .
database is used with a windows software Show and Save Persian texts good .
I don't want to change the charset because windows software work with that charset .
Question:
how can convert latin1 to utf8 to show and utf8 to latin1 for saving from my web-base PHP app , or using Persian/Arabic language on a latin1 charset database without problem ?
note:
one of my texts is احمد رحمانی when save from my windows-based software save as ÇÍãÏ ÑÍãÇäí and still show with احمد رحمانی in my old windows-based software
image : image of database , charsets,collation and windows-based software (full Size)
Edit: Your screenshot shows that the diagnosis below is probably correct.
What to do: Try using iconv() in your PHP web application. You would have to guess or find out what collation/codepage your Windows app uses.
Then something like this might work:
$string_decoded = iconv("windows-1256", "utf-8", $string);
You may need to experiment to get this working. Also, I think you need to force your database connection to use latin1 instead of UTF-8!
If you ask me, this is not a good basis for your web app. You would have to convert data into a broken format all the time. You may have to break compatibility with your application, or write an import tool.
The latin1 character set does not cover Persian characters. Proof at collation-charts.org
The only explanation I have why your Delphi program is able to store Arabic characters in a latin1 database is, it could be misusing the latin1 database to store data that isn't covered by latin1, e.g. Windows-1256 Arabic. So the program would store the raw bytes of each arabic character, while in fact these bytes are occupied by other, latin characters in the latin1 character set. But as long as was only the Delphi program storing and fetching the data, no one noticed.
If I'm correct in that - it's the only way I can see how what you describe could be happening - that works as long as only applications are involved that do this the same way, a way which is wrong really.
You should be able to confirm whether this is the case by looking at the data from a "neutral" database tool like phpMyAdmin or HeidiSQL. If you see garbage there instead of Arabic / Persian characters, I may be right.
As to what to do to make your PHP web app work with the same database as your Delphi app - I'm not really sure what to do to be honest. As far as I know, there is no way to force mySQL to use one encoding instead of the other. You would have to manually "re-encode" the data before fetching it into your web app. This is likely to be a painful process.
But first, try to find out what exactly is happening.

utf-8 encoding in HTML and utf8_unicode_ci char set & collation for MySQL - can I store & display any type of text in it now?

I want to dvp a small web app which would ideally be used worldwide. For the sake of the discussion, let's say it's a recipe sharing site - it's a good enough metaphor.
My app will allow users to enter or upload text in their native languages. My html header says that the site uses utf-8 encoding. I am now creating my MySQL db, and I suppose that I should select utf8_unicode_ci for the char set & collation.
Is that correct?
Is that all I need to do to be able to receive, store, and display safe user-generated-content in their chosen language? If not, what am I missing?
(I am aware of the safety concerns associated with displaying UGC, this is not what the question is about - here I am solely looking for advice to deal with safe content.)
It is all for you HTML and DB part, but you must ensure that the programming language is UTF-8 aware so it doesn't garble your stuff. If you use PHP just make sure that the functions you use are UTF-8 aware. If it isn't the manual usually mentions it.
As far as the html and the db i think this is all you need.
The only other part you may need to define that your inputs are UTF-8 encoded, is the part where you send/receive your data (assuming with a form and a post request for example).
You can check post #:1281123 in this forum, it helped a lot when i had some problems with encoding in a similar situation.

Help with multi-lingual text, php, and mysql

I have had no end of problems trying to do what I thought would be relatively simple:
I need to have a form which can accept user input text in a mix of English an other languages, some multi-byte (ie Japanese, Korean, etc), and this gets processed by php and is stored (safely, avoiding SQL injection) in a mysql database. It also needs to be accessed from the database, processed, and used on-screen.
I have it set up fine for Latin chars but when I add a mix of Latin andmulti-byte chars it turns garbled.
I have tried to do my homework but just am banging my head against a wall now.
Magic quotes is off, I have tried using utf8_encode/decode, htmlentities, addslashes/stripslashes, and (in mysql) both "utf8_general_ci" and "utf8_unicode_ci" for the field in the table.
Part of the problem is that there are so many places where I could be messing it up that I'm not sure where to begin solving the problem.
Thanks very much for any and all help with this. Ideally, if someone has working php code examples and/or knows the right mysql table format, that would be fantastic.
Here is a laundry list of things to check are in UTF8 mode:
MySQL table encoding. You seem to have already done this.
MySQL connection encoding. Do SHOW STATUS LIKE 'char%' and you will see what MySQL is using. You need character_set_client, character_set_connection and character_set_results set to utf8 which can easily set in your application by doing SET NAMES 'utf8' at the start of all connections. This is the one most people forget to check, IME.
If you use them, your CLI and terminal settings. In bash, this means LANG=(something).UTF-8.
Your source code (this is not usually a problem unless you have UTF8 constant text).
The page encoding. You seem to have this one right, too, but your browsers debug tools can help a lot.
Once you get all this right, all you will need in your app is mysql_real_escape_string().
Oh and it is (sadly) possible to successfully store correctly encoded UTf8 text in a column with the wrong encoding type or from a connection with the wrong encoding type. And it can come back "correctly", too. Until you fix all the bits that aren't UTF8, at which point it breaks.
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
First you need to check if you can add multi-language text to your database directly. If its possible you can do it in your application
Are you serializing any data by chance? PHPs serialize function has some issue when serializing non-english characters.
Everything you do should be utf-8 encoded.
One thing you could try is to json_encode() the data when putting it into the database and json_decoding() it when it's retrieved.
The problem was caused by my not having the default char set in the php.ini file, and (possibly) not having set the char set in the mysql table (in PhpMyAdmin, via the Operations tab).
Setting the default char set to "utf-8" fixed it. Thanks for the help!!
Check your database connection settings. It also needs to support UTF-8.

php + mysql + html + javascript = i18n headache

This has been always a problem for me , Character problem . I always tried to solve my problem with little patches , actually this never solves my problem in reality.So I am looking for very strong solution to solve all these problems.I want to learn how big apps(facebook , google, other multi lingual ajax apps and apis) solve this problem. I want a solution which will solve all my character encoding , etc problems.I use php, mysql, html and javascript to create my application , so the solution should solve all problems or all these languages together.If you write full configuration this is perfect , but if there is a long long document , I can read it to . I need help . Thank you . I can not transfer string(text) correctly through all these languages
Also I pull data from external apis.How should I take care of them
It's pretty easy if you just stick to using Unicode everywhere.
set MySQL table encodings to UTF-8
make sure you're talking to the database in UTF-8 by running SET NAMES utf8
save all your source code in UTF-8
when manipulating strings in PHP which may contain UTF-8 characters, use the mb_ functions
send HTTP Content-Type headers denoting that the content is in UTF-8
Javascript is intrinsically UTF-8, so you should have no worries there
The thing is that different technologies default to different character encodings. Unfortunately strings do not have implicit encoding metadata attached, they're just sequences of bytes. Unless being told, the receiver of a string can only make a best guess what encoding that sequence is supposed to be in. Whenever connecting two pieces of anything, you need to make sure they're using the same encoding (or you need to specifically convert from one encoding to the other). Always assume that you have to define the encoding somewhere, how exactly that needs to be done depends on the technology.

Getting funny squares in browser when displaying content

I have content stored in a Postgres DB, now everytime I call the content so that it gets displayed using php, i get funny squares in IE and funny square type question marks in Firefox?
Example below
* - March � May 2009
How do I remove this?
I do not have access to the server so can't adjust the encoding there, only have postgres DB details and FTP access to upload my files
I would also recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, I've read it only recently myself, it will definitely help you sort out your problems.
You need to make sure that Postgres, PHP, and your browser all agree on the content encoding, and that you have an appropriate font selected in your browser. The simplest way to do that is to choose UTF8 for everything.
I don't know about PHP, but I do know about databases and browsers. First you need to find out if the database is UTF8. (From psql, I would do a "\l" and look at the encoding.) Then you need to find out if PHP supports UTF8 (I have no idea how you do that). Then you need to see if how those characters are being stored in the database by the PHP app. Then you need to figure out if the web server is correctly reporting the content encoding. (On Linux/Unix, I'd use the program "HEAD" (not "head") to see the headers its returning.) And then you need to figure out if your browser is using a font that supports UTF8.
Or, you could just make sure you only store ASCII and forget the rest of the world exists. Not recommended.
Wrong charset somewhere. The characters could be stored wrong already in database, or you have wrong charset in meta tags on the page(try manually change charset in browser), or there could be problem with wrong encoding when page is communicating with database.
Check this page http://www.postgresql.org/docs/8.2/static/multibyte.html for more informations.
Try to have same encoding on all places, preferably UTF-8
You have encoding issues. Make sure the encoding is set right in the database, in the html markup and make sure the files themselves are saved in proper encoding.

Categories