Translating to Languages with Irregular Rules

Translating to Languages with Irregular Rules - php

In order to make a PHP content management system extensible, language translations are crucial. I was researching programming approaches for a translations system, and I thought that Qt Linguist was a good example.
This is an example usage from the Qt documentation:
int n = messages.count();
showMessage(tr("%n message(s) saved", "", n));
Qt uses known language rules to determine whether "message" has an "s" appended in English.
When I brought that example up with my development team, they discovered an issue that jeopardizes the extensibility effectiveness of modeling off of Qt's tr() function.
This is a similar example, except that something is now seriously wrong.
int n = deadBacteria.count();
showMessage(tr("%n bacterium(s) killed", "", n));
The plural of "bacterium" is "bacteria". It is improper to append an "s".
I don't have much experience with Qt Linguist, but I haven't seen how it handles irregular conjugations and forms.
A more complicated phrase could be "%n cactus(s) have grown.". The plural should be "cactii", and "have" needs to be conjugated to "has" if there is one cactus.
You might think that the logical correction is to avoid these irregular words because they are not used in programming. Well, this is not helpful in two ways:
Perhaps there is a language that modifies nouns in an irregular way, even though the source string works in English, like "%n message(s) saved". In MyImaginaryLanguage, the proper way to form the translated string could be "1Message saved", "M2essage saved", "Me3ssage saved" for %n values 1, 2, and 3, respectively, and it doesn't look like Qt Linguist has rules to handle this.
To make a CMS extensible like I need mine to be, all types of web applications need to be factored in. Somebody may build a role-playing game that requires sentences to be constructed like "5 cacti have grown." Or maybe a security software wants to say, "ClamAV found 2 viruses." as opposed to "ClamAV found 2 virus(es)."
After searching online to see if other Qt developers have a solution to this problem and not finding any, I came to Stack Overflow.
I want to know:
What extensible and effective programming technique should be used to translate strings with possible irregular rules?
What do Qt programmers and translators do if they encounter this irregularity issue?

You've misunderstood how the pluralisation in Qt works: it's not an automatic translation.
Basically you have a default string e.g. "%n cactus(s) have grown." which is a literal, in your code. You can put whatever the hell you like in it e.g. "dingbat wibble foo %n bar".
You may then define translation languages (including one for the same language you've written the source string in).
Linguist is programmed with the various rules for how languages treat quantities of something. In English it's just singular or plural; but if a language has a specific form for zero or whatever, it presents those in Linguist. It then allows you to put in whatever the correct sentence would be, in the target translation language, and it handles putting the %n in where you decide it should be in the translated form.
So whoever does the translation in Linguist would be provided the source, and has to fill in the singular and plural, for example.
Source text: %n cactus(s) have grown.
English translation (Singular): %n cactus has grown.
English translation (Plural): %n cacti have grown.
If the application can't find an installed translation then it falls back to the source literal. Also the source literal is what the person translating sees so has to infer what you meant from it. Hence "dingbat wibble foo %n bar" might not be a good idea when describing how many cacti have grown.
Further reading:
The Linguist manual
The Qt Quarterly article on Plural Form(s) in Translation(s)
The Internationalization example or the I18N example
Download the SDK and have a play.

Your best choice is to use the GNU gettext i18n framework. It is nicely integrated into PHP, and gives you tools to precisely define all the quirky grammar rules about plural forms.

Using Qt Linguist you can handle the various grammatical numbers based on the target language. So every time a %n is detected in a tr string the translator will be asked to give all necessary translations for the target language. Check this article for more details:
http://doc.qt.nokia.com/qq/qq19-plurals.html

Related

Load unicode character map when user select the language

I know this question is a bit vague and not sure this is even possible. On my web site I want to display a combo box with maximum possible languages (available in unicode) and when the user selects the language respective character map of that language should be loaded. Then users can click and complete the given text area with their comments in their own language. I am not asking for the code but a kind guide line about the possibility of this and a way to do this will be really helpful.
My ultimate need is to give user to type in any language of their choice. Do the users need to install the language in their computer before using it? Thank you.

The Unicode Standard does not divide characters by language, and there is no rigorous definition for the concept “characters used in a language”. For example, is “é” a character used in English? (Think about “fiancé”.) What about “è”? (Think about the spelling “belovèd” used in some forms of writing.)
The Unicode Consortium has created the CLDR database, which contains information about “exemplar characters” in any languages, but these are based on subjective judgement and often debatable – mostly in the sense of covering too much, which might not be serious here. The data is in an XML formal, so it could be automatically fed into an application.
There is nothing the user needs to do, or could do, to “install the language” for purposes like this. What matters is whether the user’s computer has fonts containing all the characters needed and whether the browser is able to use them.

spell correction for foreign languages like French, Russian

I want to implement spell correction for foreign languages like French, Russian mostly in javascript/php. In case of english spell checker, I can use edit distance algorithm to retrieve words from english dictionary (dictionary is constructed using Trie)and return highest frequency words. I also found articles on this, e.g. http://stevehanov.ca/blog/index.php?id=114. I think in case of foreign languages same approach can be useful.
I believe there must be API provided for different language but I don't want to introduce external dependencies of API in my application. Can someone suggest me direction, or link to any previous work done in this area. I read Peter Norvig's blog on python implementation of spell checker. But that one is for english language.

Hunspell is likely the most famous spell checker around:
http://hunspell.sourceforge.net/
There are multiple versions of hunspell, but Aspell is an alternative:
http://aspell.net/

Double Underscore in PHP, Wordpress, phpMyAdmin, C, i18n, L10n etc?

To quote another question that ended me here.
What does the double underscores in these lines of PHP code mean?
$WPLD_Trans['Yes']=__('Yes',$WPLD_Domain);
$WPLD_Trans['No']=__('No',$WPLD_Domain);
and related question about the usage of __(), _() etc. in Wordpress etc.
Started as an answer to the one mentioned above, and other related. But post as own question (with answer) as it became a bit more detailed.
Please feel free to edit and improve – or enter better Answer / Question.

The usage of __(…)
Double underscore is used by various implementations. Wordpress (WP) is mentioned, phpMyAdmin (PMA) is another, and so on.
It reflects on the native PHP function gettext(), which in PHP has an alias as a single underscore: _(). This is not unique to PHP. Gettext is natively a long-standing GNU project for writing multilingual programs on Unix like systems. We find the _() alias in various programming languages.
Reflection on the subject
In ones humble/honest opinion, one could say, _() as native and __() as custom implementations of gettext, or other locale specific uses, is the only valid ones - as it is a well established convention. On the other hand it is not good, but even so concise. E.g. in ANSI-C standard double underscore __ is reserved for the compiler's internal use; and as PHP is as closely related to C as it is, they have reserved it, and in general it is not the best thing to mix up even across languages (IMHO), it is on the border/thin ice/etc.
(One issue here is the choice of UnderscoreJS to use _ as their foundation on something that has nothing to do with language support.) Mix it in with prototype.js's $ and $$ and there is might some understanding why someone scratches their head when seeing __() – especially taking into account how close PHP and JS are in implementation of a working system.
Locale support
On this step I make some notes on is the usage: Locale support with weight on i18n and L10n.
It is often used, as with WP, 2, 3 and (the well documented) PMA, by giving language support.
E.g.:
__('Error'); would yield Feil in Norwegian or Fehler in German. But, one important thing to take note of here is that, in the same realm, we have more then direct translation, but also other quirks one find in differing locales/cultures/languages.
E.g. (to specify):
Some languages uses comma – , – as fractional separator, others use dot – . – for the same purpose. Other typical locale variations is centimetre vs inches, date format, 12 vs 24 hour clock, currency etc. So:
NUMBER LEN TIME
LANG1: 3,133.44 2.00 inch 5:00 PM
LANG2: 3.133,44 5,08 cm 17:00
Next layer of code you unveil would probably have some reference to the rather cryptic conventions like i18n or l10n in variable names, functions and file-names.
Once you know what they mean they are quite handy, tho definitions can be somewhat blurry. In general we have:
i18n: Internationalization
l10n: Localization (Often L10n)
g11n: Globalization (Used by e.g. IBM and Sun Microsystems)
l12y: Localizability (Microsoft)
m17n: Multilingualization (Continuum between internationalization and localization)
... And so on.
aich, my head hurts.
The number refers to number of letters in the words between first and last letter. The letter before and after – is the first and last letter in the word. Phew. So:
i18n l10n
internationalization localization
| | | |
+-- 18 letters --+ +-- 10 --+
i 18 n l 10 n
(We also have more powerful numeronyms that does not follow this convention like G8 (Group of Eight) -, but then we're outside the realm of programming - or are we?).
Then to the definitions:
W3C has a rather sane definition of Internationalization (i18n), and Localization (l10n).
Internationalization, (i18n), is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.
Localization, (l10n), refers to the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a locale).
The short, and concise, version from debian on the subject is:
Internationalization (I18N): To make a software potentially handle multiple locales.
Localization (L10N): To make a software handle an specific locale.
The use of domains
Frequently, when looking at code, reading about, etc. localization one come across the usage of domain. A more search engine friendly-name would perhaps be Translation Domain.
(Here I'm a bit on thin ice.).
In short one could define this as context. As a base you have a locale. E.g. Scotland. But you can further define a domain as part of a theme, profession etc. E.g. would Java in programming be rather different from a coffee shop. Or cow on a degenerated site about women vs. farming.
A page giving some clue is repoze.org with:
Translation Domain
Translation Directory
A translation directory would typically be:
/path/to/your/translation_root/en_US/LC_MESSAGES/
/path/to/your/translation_root/en_GB/LC_MESSAGES/
/path/to/your/translation_root/nn_NO/LC_MESSAGES/
…
Files used with GetText
If you take a look at e.g. PMA you will find a directory called locale. It contains various directories for various languages. The files are with .mo extension. This is GetText specific files. One typically start with PO, Portable Object, files which are plain text files. These are indexed and compiled to MO, Machine Object, files which are indexed for optimization. More about this here (Same as the PMA-link above).

Is there a way for PHP (or jQuery) to check if a string is human readable?

Human readable, meaning the string is a real word. This is essentially a form validation. Ideally I'd like to test the 'texture' of the form responses to determine if an actual user has filled out the form versus someone looking for form vulnerabilities. Possibly using a dictionary look-up on the POSTed data and then giving a threshold of returned 'real words'.
I don't see anything in the PHP docs and the Google machine isn't offering up anything, at least this specific. I suspect that someone out there has written a PHP class or even a jQuery plugin that can do this. Something like so:
$string = "laiqbqi";
is_this_string_human_readable($string);
Any ideas?

This can be done using something called Markov Chains.
Essentially, they read through a large chunk of text in a given language (English, French, Russian, etc.) and determine the probability of one character being after another.
e.g. a "q" has a much lower probability of occurring after a "z" than a vowel such as "a" does.
At a lower level, this is actually implemented as a state machine.
As per Mike's comment, a PHP version of this can be found here.
For flavor, an amusing the Daily WTF article on Markov Chains.

Conversion from Simplified to Traditional Chinese

If a website is localized/internationalized with a Simplified Chinese translation...
Is it possible to reliably
automatically convert the text to
Traditional Chinese in a high quality
way?
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
Are there open source tools (ideally in PHP) to do
such a conversion?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?

Short answer: No, not reliably+high quality. I wouldn't recommend automated tools unless the market isn't that important to you and you can risk certain publicly embarrassing flubs. You may find some localization firms are happier to start with a quality simplified Chinese translation and adapt it to traditional, but you may also find that many companies prefer to start with the English source.
Longer answer: There are some cases where only the glyphs are different, and they have different unicode code points. But there are also some idiomatic and vocabulary differences between the PRC and Taiwan/Hong Kong, and your quality will suffer if these aren't handled. Technical terms may be more problematic or less, depending on the era in which the terms became commonly used. Some of these issues may be caught by automated tools, but not all of them. Certainly, if you go the route of automatically converting things, make sure you get buyoff from QA teams based in each of your target markets.
Additionally, there are sociopolitical concerns as well. For example, you can use terms like "Republic of China" in Taiwan, but this will royally piss off the Chinese government if it appears in your simplified Chinese version (and sometimes your English version); if you have an actual subsidiary or partner in China, the staff may be arrested solely on the basis of subversive terminology. (This is not unique to China; Pakistan/India and Turkey have similar issues). You can get into similar trouble by referring to "Taiwan" as a "country."

As a native Hong Konger myself, I concur with #JasonTrue: don't do it. You risk angering and offending your potential users in Taiwan and Hong Kong.
BUT, if you still insist on doing so, have a look at how Wikipedia does it; here is one implementation (note license).

Is it possible to reliably automatically convert the text to Traditional Chinese in a high quality way?
Other answers are focused on the difficulties, but these are exaggerated. One thing is that a substantial portion of the characters are exactly the same. The second thing is the 'simplified' forms are exactly that: simplified forms of the traditional characters. That means mostly there is a 1 to 1 relationship between traditional and simplified characters.
If so, is it going to be extremely high quality or just a good starting point for a translator to tweak?
A few things will need tweaking.
Are there open source tools (ideally in PHP) to do such a conversion?
Not that I am aware of, though you might want to check out the google translate api?
Is the conversion better one way vs. the other (simplified -> traditional, or vice versa)?
A few characters lost distinction in the simplified alphabet. For instance 麵(flour) was simplified to the same character as 面(face, side). For this reason traditional->simplified would be slightly more accurate.
I'd also like to point out that traditional characters are not solely in use in Taiwan (They can be found in HK and occasionally even in the mainland)
I was able to find this and this. Need to create an account to download, though. Never used the site myself so I cannot vouch for it.

Fundamentally, simplified Chinese words have a lot of missing meanings. No programming language in the world will be able to accurately convert simplified Chinese into traditional Chinese. You will just cause confusion for your intended audience (Hong Kong, Macau, Taiwan).
A perfect example of failed translation from simplified Chinese to traditional Chinese is the word "后". In the simplified form, it has two meanings, "behind" or "queen". When you attempt to convert this back to traditional Chinese, however, there can be more than two character choices: 後 "behind" or 后 "queen". One funny example I came across is a translator which converted "皇后大道" Queen's Road to "皇後大道", which literally means Queen's Behind Road.
Unless your translation algorithm is super smart, it is bound to produce errors. So you're better off hiring a very good translator who's fluent in both types of Chinese.

Short answer: Yes. And it's easy. You can firstly convert it from UTF-8 to BIG5, then there are lots of tools for you to convert BIG5 to GBK, then you can convert GBK to UTF-8.

I know nothing about any form of Chinese, but by looking at the examples in this Wikipedia page I'm inclined to think that automatic conversion is possible, since many of the phrases seem to use the same number of characters and even the some of the same characters.
I ran a quick test using a multibyte ord() function and I can't see any patterns that would allow the automatic conversion without the use of a (huge?) lookup translation table.
Traditional Chinese 漢字
Simplified Chinese 汉字
function mb_ord($string)
{
if (is_array($result = unpack('N', iconv('UTF-8', 'UCS-4BE', $string))) === true)
{
return $result[1];
}
return false;
}
var_dump(mb_ord('漢'), mb_ord('字')); // 28450, 23383
var_dump(mb_ord('汉'), mb_ord('字')); // 27721, 23383
This might be a good place to start building the LUTT:
Simplified/Traditional Chinese Characters List
I got to this other linked answer that seems to agree (to some degree) with my reasoning:
There are several countries where
Chinese is the main written language.
The major difference between them is
whether they use simplified or
traditional characters, but there are
also minor regional differences (in
vocabulary, etc).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.