How to check file encoding in Linux? Handling multilingual scripts

How to check file encoding in Linux? Handling multilingual scripts - php

My company has php scripts with texts in different languages (including french, german, spanish, italian and english).
Developers decided to use Latin-1 encoding as base for everyone, so this way nobody will override file encoding and corrupt foreign languages in it. (At first some developers used html entities, but this way is not preferred)
I have few questions for you:
How can you check file encoding on linux?
If you had experience working with files in different languages, how did you manage to not override encoding of others?
Thanks for any advise in advance

file gives you informations about a file, including, charset, languages, etc.. depending on file type.
Use --mime-encoding to get only the information you want.

Developers decided to use Latin-1 encoding as base for everyone, so this way nobody will override file encoding and corrupt foreign languages in it.
Latin-1 can't handle most languages. Flavours of Unicode (typically UTF-8) are preferred.
How can you check file encoding on linux?
With the file utility. It can only guess though.
If you had experience working with files in different languages, how did you manage to not override encoding of others?
Sensibly configured editors.

1. I have used iconv for converting back and forth, but since you don't know the encoding, try enca (Extremely Naive Charset Analyser) first. But in general, it is very hard to get it right since it requires knowledge of common words etc.
2. The only sane approach is to use a larger charset such as unicode for this. You could enforce this by adding a pre-checkin hook to your source control system which only allows correctly formatted utf-8 files (for instance).

There is no reliable way of checking the encoding of a file; the various 8-bit single-byte encodings are virtually indistinguishable without inspection. Using UTF-8 everywhere means that everyone has a single, universally-valid encoding to work with.

Related

PHP utf-8 best practices and risks for distributed web applications

I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.

First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

What should I know to make my I18N application work in Japanese?

I'm working on a I18N application which will be located in Japanese, I don't know any word in Japanese, and I'm first wondering if utf8 is enough for that language.
Usually, for European language, utf8 is enough, and I've to set up my database charset/collation to use utf8_general_ci (in MySQL) and my html views in utf8, and it's enough.
But what about Japanese, is there something else to do?
By the way my application would be able to handle English, French, Japanese, but later on, it may be needed to add some languages, let's say, Russian.
How could I set up my I18N application to be available widely without having to change much configurations on deployment?
Is there any best practices?
By the way, I'm planning to use gettext, I'm pretty sure it supports such languages without any problems as it is the de facto standard for almost all GNU softwares, but any feedback?

A couple of points:
UTF-8 is fine for your app-internal data, but if you need to process user-supplied documents (e.g. uploads), those may use other encodings like Shift-JIS or ISO-2022-JP
Japanese text does not use whitespace between words. If your app needs to split text into words somewhere, you've got a problem.
Apart from text, date and number formats differ
The generic collation may not lead to a useful sort order for Japanese text - if your app involves large lists that people have to find things in, this can be a problem.

Yep, Unicode contains all the code points you need to display English, French, Japanese, Russian, and pretty much any language in the world (including Taiwanese, Cherokee, Esperanto, really anything but Elfish). That's what it's for. Due to the nature of UTF8, though, text in more esoteric languages will take a few bytes more to store.
Gettext is widely used and your PHP build probably even includes it. See http://php.net/gettext for usage details.

Just to add that interesting website to help build I18N application: http://www.i18nguy.com/

If you store text in text files then it goes like this:
This is the main folder structure for language:
-lang
-en
-fr
-jp
etc
every subfolder, en, fr... contains the same files, the same variables with different values.
For example in lang/en/links.txt
You would have
class txtLinks
{
public static $menu="Menu";
public static $products="Show products";
....
class txtErrors
{
public static $wrongUName="This user does not exists";
....
Then when a script loads you do
if(en)
define(__LANG,'en')
if(fr)
define(__LANG,'fr')
...
Then
include('lang'.__LANG.'what ever file you want')
Then this is a piece from your php script:
echo txtLink::$menu etc...
If you go the database way you do sth analogous, where instead of files you have tables.
This way you have absolute freedom cause you can give the english files to some person who speaks let's say french and he is able to fill the values in french without being required to know programming at all.
And you your self don't care what language is later added or removed.
And if you work on mvc you can split language files in accordance to controllers so you don't result in loading a huge text file.

___ encoding to UTF-8 - is there an end-all solution?

I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?

No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php

Change Website Character encoding from iso-8859-1 to UTF-8

About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?

The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.

Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.

Is it possible to write a great PHP app which uses Unicode?

My next web application project will make extensive use of Unicode. I usually use PHP and CodeIgniter however Unicode is not one of PHP's strong points.
Is there a PHP tool out there that can help me get Unicode working well in PHP?
Or should I take the opportunity to look into alternatives such as Python?

PHP can handle unicode fine once you make sure to encode and decode on entry and exit. If you are storing in a database, ensure that the language encodings and charset mappings match up between the html pages, web server, your editor, and the database.
If the whole application uses UTF-8 everywhere, decoding is not necessary. The only time you need to decode is when you are outputting data in another charset that isn't on the web. When outputting html, you can use
htmlentities($var, ENT_QUOTES, 'UTF-8');
to get the correct output. The standard function will destroy the string in most cases. Same goes for mail functions too.
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet is a very good resource for working in UTF-8

One of the Major feature of PHP 6 will be tightly integrated with UNICODE support.
Implementing UTF-8 in PHP 5.
Since PHP strings are byte-oriented, the only practical encoding scheme for Unicode text is UTF-8. Tricks are [Got it from PHp Architect Magazine]:
Present HTML pages in UTF-8
Convert PHP scripts to UTF-8
Convert the site content, back-end databases and the like to UTF-8
Ensure that no PHP functions corrupt the UTF-8 text
Check out http://www.gravitonic.com/talks/ PHP UTF 8 Cheat Sheet

PHP is mostly unaware of chrasets and treats strings as bytestreams. That's not much of a problem really, but you'll have to do a bit of work your self.
The general rule of thumb is that you should use the same charset everywhere. If you use UTF-8 everywhere, then you're 99% there. Just make sure that you don't mix charsets, because then it gets really complicated. The only thing that won't work correct with UTF-8, is string manipulation, which needs to operate on a character level. Eg. strlen, substr etc. You should use UTF-8-aware versions in place of those. The multibyte-string extension gives you just that.
For a checklist of places where you need to make sure the charset is set correct, look at:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
For more information, look at:
http://www.phpwact.org/php/i18n/utf-8

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.