If a website is localized/internationalized with a Simplified Chinese translation...
Is it possible to reliably
automatically convert the text to
Traditional Chinese in a high quality
way?
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
Are there open source tools (ideally in PHP) to do
such a conversion?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?
Short answer: No, not reliably+high quality. I wouldn't recommend automated tools unless the market isn't that important to you and you can risk certain publicly embarrassing flubs. You may find some localization firms are happier to start with a quality simplified Chinese translation and adapt it to traditional, but you may also find that many companies prefer to start with the English source.
Longer answer: There are some cases where only the glyphs are different, and they have different unicode code points. But there are also some idiomatic and vocabulary differences between the PRC and Taiwan/Hong Kong, and your quality will suffer if these aren't handled. Technical terms may be more problematic or less, depending on the era in which the terms became commonly used. Some of these issues may be caught by automated tools, but not all of them. Certainly, if you go the route of automatically converting things, make sure you get buyoff from QA teams based in each of your target markets.
Additionally, there are sociopolitical concerns as well. For example, you can use terms like "Republic of China" in Taiwan, but this will royally piss off the Chinese government if it appears in your simplified Chinese version (and sometimes your English version); if you have an actual subsidiary or partner in China, the staff may be arrested solely on the basis of subversive terminology. (This is not unique to China; Pakistan/India and Turkey have similar issues). You can get into similar trouble by referring to "Taiwan" as a "country."
As a native Hong Konger myself, I concur with #JasonTrue: don't do it. You risk angering and offending your potential users in Taiwan and Hong Kong.
BUT, if you still insist on doing so, have a look at how Wikipedia does it; here is one implementation (note license).
Is it possible to reliably automatically convert the text to Traditional Chinese in a high quality way?
Other answers are focused on the difficulties, but these are exaggerated. One thing is that a substantial portion of the characters are exactly the same. The second thing is the 'simplified' forms are exactly that: simplified forms of the traditional characters. That means mostly there is a 1 to 1 relationship between traditional and simplified characters.
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
A few things will need tweaking.
Are there open source tools (ideally in PHP) to do such a conversion?
Not that I am aware of, though you might want to check out the google translate api?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?
A few characters lost distinction in the simplified alphabet. For instance 麵(flour) was simplified to the same character as 面(face, side). For this reason traditional->simplified would be slightly more accurate.
I'd also like to point out that traditional characters are not solely in use in Taiwan (They can be found in HK and occasionally even in the mainland)
I was able to find this and this. Need to create an account to download, though. Never used the site myself so I cannot vouch for it.
Fundamentally, simplified Chinese words have a lot of missing meanings. No programming language in the world will be able to accurately convert simplified Chinese into traditional Chinese. You will just cause confusion for your intended audience (Hong Kong, Macau, Taiwan).
A perfect example of failed translation from simplified Chinese to traditional Chinese is the word "后". In the simplified form, it has two meanings, "behind" or "queen". When you attempt to convert this back to traditional Chinese, however, there can be more than two character choices: 後 "behind" or 后 "queen". One funny example I came across is a translator which converted "皇后大道" Queen's Road to "皇後大道", which literally means Queen's Behind Road.
Unless your translation algorithm is super smart, it is bound to produce errors. So you're better off hiring a very good translator who's fluent in both types of Chinese.
Short answer: Yes. And it's easy. You can firstly convert it from UTF-8 to BIG5, then there are lots of tools for you to convert BIG5 to GBK, then you can convert GBK to UTF-8.
I know nothing about any form of Chinese, but by looking at the examples in this Wikipedia page I'm inclined to think that automatic conversion is possible, since many of the phrases seem to use the same number of characters and even the some of the same characters.
I ran a quick test using a multibyte ord() function and I can't see any patterns that would allow the automatic conversion without the use of a (huge?) lookup translation table.
Traditional Chinese 漢字
Simplified Chinese 汉字
function mb_ord($string)
{
if (is_array($result = unpack('N', iconv('UTF-8', 'UCS-4BE', $string))) === true)
{
return $result[1];
}
return false;
}
var_dump(mb_ord('漢'), mb_ord('字')); // 28450, 23383
var_dump(mb_ord('汉'), mb_ord('字')); // 27721, 23383
This might be a good place to start building the LUTT:
Simplified/Traditional Chinese Characters List
I got to this other linked answer that seems to agree (to some degree) with my reasoning:
There are several countries where
Chinese is the main written language.
The major difference between them is
whether they use simplified or
traditional characters, but there are
also minor regional differences (in
vocabulary, etc).
Related
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
In order to make a PHP content management system extensible, language translations are crucial. I was researching programming approaches for a translations system, and I thought that Qt Linguist was a good example.
This is an example usage from the Qt documentation:
int n = messages.count();
showMessage(tr("%n message(s) saved", "", n));
Qt uses known language rules to determine whether "message" has an "s" appended in English.
When I brought that example up with my development team, they discovered an issue that jeopardizes the extensibility effectiveness of modeling off of Qt's tr() function.
This is a similar example, except that something is now seriously wrong.
int n = deadBacteria.count();
showMessage(tr("%n bacterium(s) killed", "", n));
The plural of "bacterium" is "bacteria". It is improper to append an "s".
I don't have much experience with Qt Linguist, but I haven't seen how it handles irregular conjugations and forms.
A more complicated phrase could be "%n cactus(s) have grown.". The plural should be "cactii", and "have" needs to be conjugated to "has" if there is one cactus.
You might think that the logical correction is to avoid these irregular words because they are not used in programming. Well, this is not helpful in two ways:
Perhaps there is a language that modifies nouns in an irregular way, even though the source string works in English, like "%n message(s) saved". In MyImaginaryLanguage, the proper way to form the translated string could be "1Message saved", "M2essage saved", "Me3ssage saved" for %n values 1, 2, and 3, respectively, and it doesn't look like Qt Linguist has rules to handle this.
To make a CMS extensible like I need mine to be, all types of web applications need to be factored in. Somebody may build a role-playing game that requires sentences to be constructed like "5 cacti have grown." Or maybe a security software wants to say, "ClamAV found 2 viruses." as opposed to "ClamAV found 2 virus(es)."
After searching online to see if other Qt developers have a solution to this problem and not finding any, I came to Stack Overflow.
I want to know:
What extensible and effective programming technique should be used to translate strings with possible irregular rules?
What do Qt programmers and translators do if they encounter this irregularity issue?
You've misunderstood how the pluralisation in Qt works: it's not an automatic translation.
Basically you have a default string e.g. "%n cactus(s) have grown." which is a literal, in your code. You can put whatever the hell you like in it e.g. "dingbat wibble foo %n bar".
You may then define translation languages (including one for the same language you've written the source string in).
Linguist is programmed with the various rules for how languages treat quantities of something. In English it's just singular or plural; but if a language has a specific form for zero or whatever, it presents those in Linguist. It then allows you to put in whatever the correct sentence would be, in the target translation language, and it handles putting the %n in where you decide it should be in the translated form.
So whoever does the translation in Linguist would be provided the source, and has to fill in the singular and plural, for example.
Source text: %n cactus(s) have grown.
English translation (Singular): %n cactus has grown.
English translation (Plural): %n cacti have grown.
If the application can't find an installed translation then it falls back to the source literal. Also the source literal is what the person translating sees so has to infer what you meant from it. Hence "dingbat wibble foo %n bar" might not be a good idea when describing how many cacti have grown.
Further reading:
The Linguist manual
The Qt Quarterly article on Plural Form(s) in Translation(s)
The Internationalization example or the I18N example
Download the SDK and have a play.
Your best choice is to use the GNU gettext i18n framework. It is nicely integrated into PHP, and gives you tools to precisely define all the quirky grammar rules about plural forms.
Using Qt Linguist you can handle the various grammatical numbers based on the target language. So every time a %n is detected in a tr string the translator will be asked to give all necessary translations for the target language. Check this article for more details:
http://doc.qt.nokia.com/qq/qq19-plurals.html
Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?
In english it can be done by:
SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'
or even a "dirty" solution like:
SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"
Is there a similar solution for eastern languages / CJK characters?
I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.
The words are stored in a utf-8 string field.
If this cannot be done in mysql, can it be done in PHP?
Thanks! :)
edit 1: The data does not include in which language the string is therefore I cannot filter by another field.
edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.
Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:
Warning The REGEXP and RLIKE operators
work in byte-wise fashion, so they are
not multi-byte safe and may produce
unexpected results with multi-byte
character sets.
Fortunately in PHP you can build such a regexp e.g. with
/[\x{1234}-\x{5678}]*/u
(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).
You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.
But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.
That said, here are some useful links you may be interested in:
Detect language from string in PHP
http://en.wikipedia.org/wiki/Hidden_Markov_model
The latter is relatively complex to implement, but yields a much better result.
Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.
An interesting paper that should demonstrate the futility of this excercise is:
http://xldb.lasige.di.fc.ul.pt/xldb/publications/ngram-article.pdf
Finally, you ask:
If this cant be done in mysql - how can it be done in PHP?
It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.
you may consider another data structure that contains the words and or characters, and the language you want to associate them with.
the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.
Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.
Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.
Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!
There are some heuristics though.
Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.
Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.
But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.
One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!
And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.
I'm working on a I18N application which will be located in Japanese, I don't know any word in Japanese, and I'm first wondering if utf8 is enough for that language.
Usually, for European language, utf8 is enough, and I've to set up my database charset/collation to use utf8_general_ci (in MySQL) and my html views in utf8, and it's enough.
But what about Japanese, is there something else to do?
By the way my application would be able to handle English, French, Japanese, but later on, it may be needed to add some languages, let's say, Russian.
How could I set up my I18N application to be available widely without having to change much configurations on deployment?
Is there any best practices?
By the way, I'm planning to use gettext, I'm pretty sure it supports such languages without any problems as it is the de facto standard for almost all GNU softwares, but any feedback?
A couple of points:
UTF-8 is fine for your app-internal data, but if you need to process user-supplied documents (e.g. uploads), those may use other encodings like Shift-JIS or ISO-2022-JP
Japanese text does not use whitespace between words. If your app needs to split text into words somewhere, you've got a problem.
Apart from text, date and number formats differ
The generic collation may not lead to a useful sort order for Japanese text - if your app involves large lists that people have to find things in, this can be a problem.
Yep, Unicode contains all the code points you need to display English, French, Japanese, Russian, and pretty much any language in the world (including Taiwanese, Cherokee, Esperanto, really anything but Elfish). That's what it's for. Due to the nature of UTF8, though, text in more esoteric languages will take a few bytes more to store.
Gettext is widely used and your PHP build probably even includes it. See http://php.net/gettext for usage details.
Just to add that interesting website to help build I18N application: http://www.i18nguy.com/
If you store text in text files then it goes like this:
This is the main folder structure for language:
-lang
-en
-fr
-jp
etc
every subfolder, en, fr... contains the same files, the same variables with different values.
For example in lang/en/links.txt
You would have
class txtLinks
{
public static $menu="Menu";
public static $products="Show products";
....
class txtErrors
{
public static $wrongUName="This user does not exists";
....
Then when a script loads you do
if(en)
define(__LANG,'en')
if(fr)
define(__LANG,'fr')
...
Then
include('lang'.__LANG.'what ever file you want')
Then this is a piece from your php script:
echo txtLink::$menu etc...
If you go the database way you do sth analogous, where instead of files you have tables.
This way you have absolute freedom cause you can give the english files to some person who speaks let's say french and he is able to fill the values in french without being required to know programming at all.
And you your self don't care what language is later added or removed.
And if you work on mvc you can split language files in accordance to controllers so you don't result in loading a huge text file.
I'm self-taught php programmer, so I often don't know the "correct" way to do something. I want to normalize my character encoding practices between my PHP, HTML, and MySQL data.
-- live in the US,
-- work on sites for people who speak English,
-- most foreign languages I will encounter are western (Spanish, Italian, French)
-- living near NYC I could encounter Hebrew, Russian, etc. though I'd avoid using their character system and would only use whatever accents necessary to use Latin characters.
Anyone want to comment on which I should choose UTF-8 or ISO-8859-1? Or something else?
Chris
IMHO it's always best to work with UTF-8.
Your preferences will not always reflect your users' preferences, which might happen to really like hebrew or Russian. שלום!
If your application is going to support only one language - it's better to use that language's native encoding. With two or more - consider UTF-8.