I'm using po files to translate my application using the gettext function.
I have a lot of strings using formatting characters like spaces, colons, question marks, etc....
What's best practice here?
E.g.:
_('operating database: '). DB_NAME. _(' on ').DB_HOST;
_('Your name:');
or
_('operating database').': '. DB_NAME.' '._('on').' '.DB_HOST;
_('Your name').':';
Should I keep them in translation or is it better to let them hardcoded? What are the pros and cons?
Neither of your examples is good.
The best practice is to have one string per one self-contained displayed unit of text. If you're showing a message box, for example, then all of its content should be one translatable string, even if it has more than one sentence. A label: one string; a message: one string.
Never, unless you absolutely cannot avoid it, break a displayed piece of text into multiple strings concatenated in code, as the above examples do. Instead, use string formatting:
sprintf(_('operating database: %s on %s'), $DB_NAME, $DB_HOST);
The reason is that a) some translations may need to put the arguments in different order and b) it gives the translator some context to work with. For example, "on" alone can be translated quite differently in different sentences, even in different uses in your code, so letting the translator translate just that word would inevitably lead to poor, hard to understand, broken translations.
The GNU gettext manual has a chapter on this as well.
If you keep them in translation than all translations will duplicate them. This means all this spaces, colons and etc will be duplicated for each language. What for?
I'm standing for translating just the meaning parts of the strings (second variant).
Related
I'm starting to use poedit for PHP text translations. Now I'm confused what to do with long multiline texts, like a page with Terms and Conditions. I'm having a discussion about this with a colleague.
I see two options:
Per paragraph or line one poedit field
This keeps the text inside a echo _('lorem ipsum'); small.
Disadvantage: if you change the text and need to add lines, you would have to add lines in the code, which is not desirable. You probably wouldn't do this, and then use one identifier to add two paragraphs, which goes against this method.
All text in one poedit field
Disadvantage: This would result in a very long identifier. I'm not going to give an example - you know how long they can be. The identifier text could span more than a screen.
Identifier to describe what should be displayed
The identifier could be replaced by a short line describing what it is about, like this:
<?php _('Terms and conditions: complete text'); ?>
Disadvantage: if a new language is added, and this text is not translated, the identifier will show.
Advantage: if paragraphs are added or removed, you can do so without having to change the source code.
Advantage: sometimes one English word is translated into different meanings depending on context, like "bold":
Font weight: bold - translates to "fett" in German.
How are you feeling today: bold - translates to "mutig" in German.
NB: you should read this as a form field where you can select what font-weight you want: (1) normal, (2) bold, or (3) italic, not as one text "Font weight: bold".
In these two cases an identifier like "bold-font" and "bold-feeling" would provide context.
Questions
How do you handle long texts in poedit? What is best practise?
Gettext is not suitable for long free form texts at all. The correct answer is a third one you didn’t mention: just don’t. Use e.g. localized text files instead.
To comment on some of your ideas:
Per paragraph or line one poedit field
Chopping the text into individual lines is an extremely bad idea. It would loose important context.
Identifier to describe what should be displayed
...
Advantage: if paragraphs are added or removed, you can do so without having to change the source code.
Very bad idea as well (generally, not just for long text). If the source (English) text changes, there’s nothing to alert you to do the need to update the translation. Inevitably, the translations will get out of sync, undetectably. This is not an “advantage” at all.
There’s a reason why gettext uses source text as the key: it works.
Advantage: sometimes one English word is translated into different meanings depending on context, like "bold":
I don’t see how are different meanings of same words/short texts related to the question at all. That’s handled naturally by gettext with message contexts (msgctxt) as discussed in the manual.
I have been working on web developement for quite some time now and I have always struggled to find a clean solution for a problem I have encountered during i18n of HTML strings, mostly anchor tags.
First of let me show you a typical problematic example. This is a frequently encountered string in HTML templates:
Welcome to my site. Check out our cool products
you should not miss.
How do I translate this string while still having the following properties:
Dynamic generation of the URL (e.g. using a router)
A translatable string that is as readable as possible (so translators can do it w/o looking at the code)
Because the string contains HTML, I probably want to escape some parts I insert (e.g. the URL), so I don't make myself vulnerable to XSS if this URL contains user input
It should look as good as possible in the code as well
How do you translate your strings when they contain dynamic content and HTML?
When I now want to apply i18n to that string, I probably turn to gettext or a framework function. Since I come from the PHP/Joomla! world, I used JText::_ before, which acts very similar to gettext. In Python I now use Babel. Both share the same problem and probably more languages, too. All code I share here is my way of doing it in Python, more explicitly, in my Mako templates
Of course, the problem is: There is HTML in our string to be translated (and a URL, for that matter). Here are my options, which I will each explain afterwards:
Passing the raw string to gettext
Splitting the text into three bits
Surrounding linked word with variables
Using one variable that gets build seperately
Passing the raw string to gettext
This one seems the first approach one might take, if not aware of the implications.
Approach 1:
_('Welcome to my site. Check out our cool products \
you should not miss.')
For this msgid you could now translate it, keeping the HTML intact.
Advantages:
This looks very clean in the code and is easy to understand
If the translator is keeping the HTML intact this does not produce any problems
Disadvantages:
The translator has to know at least a little HTML
The string is completely unflexible, e.g. if the URL changes, all translations have to be adjusted
It does not allow for dynamic generation of the URL using something like a router
So as a conclusion, while I used this I quickly hit my limit. My next idea was:
Splitting the text into three bits
Approach 2:
_('Welcome to my site. Check out our cool ') + '<a href="/products">' +\
_('products') + '</a>' + _(' you should not miss.')
Advantages:
The URL is completely flexible now
Only actual text for the translators
Disadvantages:
Splits a sentence into three parts
Translator has to know which parts relate together or he might not be able to produce meaningful sentences
Not very pretty in code
The msgid may be a single word, which can cause problems (beware of contexts) but can be fixed.
I used this technique for some time because I did not know about printf style strings in PHP (which I used back then). Because this looked so ugly, I tried a different approach:
Surrounding linked word with variables
Approach 3:
_('Welcome to my site. Check out our cool %sproducts%s you should not miss.' % \
('', '')
Advantages:
Single string to translate, a complete sentence
Translator gets the context right from the string
Code is not that ugly
Disadvantages:
Translator has to take care that no %s goes missing (might be confusion as it reads like sproducts)
Introduces two format string variables for every URL, one being only </a>
Using one variable that gets build seperately
From here I had some different approaches, but I finally came of with the one I currently use (which might look like overkill, but I perfer it for now).
Approach 4:
_('Welcome to my site. Check out our cool %s \
you should not miss.') % ('%s' % ('/products', _('products')))
Let me take some time to reason this (seemingly lunatic) approach. First of all, the actual translation string looks like this:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.')
Which leaves a translator with the information what is inserted there (that's the translationstring version). Second, I want to ensure that I can manually escape all parts that are inserted into the HTML. While Mako provides automatic escaping, this does not make sense in a statement like this:
${'This is a url'}
It would destroy the url so I have to apply the |n filter to remove any escaping. However, if any argument of that is user supplied, it also opens up to XSS which I want to prevent. Not taking any risk, I can just escape any input (the same way good template engines do by defualt) and then remove Mako's escaping for this one string. So
'%s' % ('/products', _('products'))
actually looks like
'%s' % (escape('/products'), _('products'))
where escape is imported from markupsafe (see Markupsafe).
The final part now is dynamic URLs through a router: request.route_url('products_view')
To combine each of these possibilities, I have to produce something very ugly (note that this uses the mapping keyword argument of translationstring (translationstring.TranslationString) but that combines all the benefits I want/need from translation:
Final result:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.', mapping={'product_url': '%s' %\
(escape(request.route_url('products_view')), _('products'))})
Advantages:
Full HTML escpaing
Fully dynamic
Very good msgids for translation
Disadvantages:
An extremely ugly construct in the template (or the program anyway)
The lingua extractor doesn't catch _('products') so we have do extract that manually
So that is it, this concludes my approaches to this problem. Maybe I am doing something way to complicated and you have a lot better ideas or maybe this is a problem that depends on specific types of translatable text (and one has to choose the right approach).
Did I miss any solution or anything that would improve my approach?
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
Is there a way to detect hard coded label text that potentially needs to be replaced by a label within a PHP application? I am not just talking about PHP files but also about javascript, xml files and SMARTY/TWIG templates. Are there standard procedures within multilingual applications?
For PHP you could iterate over the template files using token_get_all().
You'd look at T_STRING tokens and then check if they are not in the same format your placeholders are. For example: "All Uppercase" or something like that.
For xml it's pretty much the same deal, iterating over the nodes and checking if any text content is flying around where you would only expect placeholders.
Our Search Engine is a tool for efficient searching across large code bases, indexing the language lexical structure to speed the search. It is thus faster than grep, and allows much more nuanced queries in terms of those language lexemes.
A query is in the form of a series of lexemes with various constraints. One might write a query:
I=*foo* '.' S=*hello
meaning: "find an Identifier containing 'foo', followed by a concatenation operator, followed by a literal String having the letters 'hello' at the end. For PHP, the generic lexeme S represent all the string-type literals (squoted strings, dquoted strings, heredocs, etc; you can search for them specifically if you want). Because the search engine understands lexical syntax, it won't get confused by intervening whitespace, linebreaks or comments, so you don't have to know the layout to find it. (It will find comment tokens with constraints if you insist).
You don't have to put a constraint:
I=*foo* '.' S
finds any identifier dot string combination.
The query
S
by itself directly answers the OP's question of "where are literal strings?" of any type.
Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?
In english it can be done by:
SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'
or even a "dirty" solution like:
SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"
Is there a similar solution for eastern languages / CJK characters?
I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.
The words are stored in a utf-8 string field.
If this cannot be done in mysql, can it be done in PHP?
Thanks! :)
edit 1: The data does not include in which language the string is therefore I cannot filter by another field.
edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.
Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:
Warning The REGEXP and RLIKE operators
work in byte-wise fashion, so they are
not multi-byte safe and may produce
unexpected results with multi-byte
character sets.
Fortunately in PHP you can build such a regexp e.g. with
/[\x{1234}-\x{5678}]*/u
(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).
You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.
But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.
That said, here are some useful links you may be interested in:
Detect language from string in PHP
http://en.wikipedia.org/wiki/Hidden_Markov_model
The latter is relatively complex to implement, but yields a much better result.
Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.
An interesting paper that should demonstrate the futility of this excercise is:
http://xldb.lasige.di.fc.ul.pt/xldb/publications/ngram-article.pdf
Finally, you ask:
If this cant be done in mysql - how can it be done in PHP?
It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.
you may consider another data structure that contains the words and or characters, and the language you want to associate them with.
the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.
Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.
Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.
Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!
There are some heuristics though.
Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.
Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.
But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.
One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!
And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.