We are working on web application in PHP and use gettext to handle strings for translation. Here is my question what we are struggling right now.
If we use kind of words/id for a string for example "menu-feature" and we would like to display it in English as "Our main Features" we can do it of course and for example first English translation will be made by a person with access to application. So than we will have "EN" po file ready.
But if we would like to send it to other translator for example Norwegian so this person will see again only our "IDs". How he/she can find what we had in mind exactly with that ID ??
Do we have bad way of thinking about using PO files? Is a some good way to make it correctly?
Common practice is to use the whole actual string in the default language as the msgid, for exactly the reason you've lined out: translators would have a much harder time figuring out the meaning behind each of your placeholders.
That said, you can add comments to each of the ids as a hint for translators. I think those lines start with a # in the .po file. They should also show up in Poedit, Virtaal or whatever they're using.
I haven't used poedit in a while, but why don't you just copy the English file, call it Norwegian, import it in poedit and start editing?
Related
Is there a tool to do what I'm suggesting in the question title? Consider the following scenario, where the msgid is "%s minutes left", the msgstr in English is, well, the same, and msgstr in another Language is "bla %s bla". Someone in marketing decides that in English the string should be "%s minutes left, hurry up!". What do I do? On the one hand, I can change the msgid. On the plus side, it becomes evident that the translation in Language needs to be updated; but the string change requires a developer to update a source file, which doesn't sound like good use of the developer's time when there are so many features to implement and time is scarce. On the other hand, a translator could update the msgstr in English, so as to match the marketing person's request, but then how do I notify translators that they have to update their translation? Also, all they see is the msgid, which hasn't changed. So how to I tell them what the actual string is?
Thanks to anyone who can help.
This is a common situation and – for better or for worse - is pretty much built into Gettext's intended workflow.
The normal workflow being:
source code > extract to POT > generate PO > translate PO
An amend to source text would then be:
edit source > extract new POT > merge into PO > review fuzzy strings
The msgmerge command line tool allows you to merge an amended POT file with already-translated PO files. This will mark any translations that had their source language altered as Fuzzy, meaning that translators might want to check again if their translation is still accurate.
msgmerge old.po changes.pot -o new.po
The POEdit program also has this function via a "Update from POT" menu option which I suspect may be using the msgmerge program in the background.
As #deceze has pointed out, using "machine IDs" separates source strings from translations and can avoid such problems. Although this is how many other localisation platforms work, it is not really how Gettext is intended to work (square peg, round hole etc.)
Perhaps worth mentioning though is that you can use "Extracted comments" feature to convey further information to translators. In the PO/POT file they are indicated with a "#. " prefix. The xgettext program extracts these from your source code, but you could also maintain them manually.
For example, if you wanted to maintain an English PO file (for your marketing people to edit) and also use machine IDs in your code, you could have something like this:
#. %s minutes left
msgid "time-nag-minutes-left"
msgstr "%s minutes left, hurry up!"
Where line 1 is your original source code comment showing the initial English, the second line is your machine ID and the third line is your marketing people editing the "translation" without any need to edit the source code again.
I'm not recommending this - it's non-standard and your translators may be used to Gettext working the way it's supposed to. It's an option though.
Working with gettext, you have two options:
Use "string ids" in the source and provide all the UI text in PO files. E.g.:
msgid "FRONTPAGE_SALES_CALL_TO_ACTION"
msgstr "Buy now, buy often!"
This has the advantage that you can change UI text purely by editing PO files, not needing to touch code. It has the disadvantage that your users will see nonsense in the UI for missing translations, and that the UI code is more abstract/meaningless and may make it somewhat harder to work with/prototype properly etc. It also introduces complexities in working with translators, as you mention yourself.
Use the literal UI text in your code and only supply translations for it in PO files.
In this case you should not have a PO file for your source language, where each msgid and msgstr are identical. That's just redundant.
If you're using this style (and apparently you are), then any actual change of text must happen at the source. The source code is your canonical UI; if you're changing the meaning of a message, it's actually a change in the UI so the UI code needs to be updated. You then sync that change with your PO files using xgettext (or similar tools), which propagates the change to all translation files and marks the changed string for translation (or even pulls out an existing translation if available).
You should not get into the habit of doing 2. but treating it like 1. That'll lead to a lot of confused WTF?! moments down the road.
If you want to enable "front end people" to change text without needing to touch complex source code, you need to decouple your front end code better from your backend code. This is not necessarily the task of gettext, but of a better template system or otherwise better MVC separation.
I have been working on a site from last 2 years.
Now my client want's make it bilingual, with English and Chinese.
Any idea what should I do for it?
You can have seperate constant files, each file has a bunch of constants in it, then you go through the pages replacing actual sentences with the constants.
Have a cookie that selects the correct constants file that the user has chosen.
This sort of design is good as it allows for more language to be added easily in the future.
It is however a pain in the ass to do.
You could go for some sort of automated translator but it wont read naturally.
I had to do this and since you are working with PHP, I recommend gettext
http://php.net/manual/en/book.gettext.php
And here's an article to get you started on it
http://mel.melaxis.com/devblog/2005/08/06/localizing-php-web-sites-using-gettext/
The benefit of gettext is that the translator never has to touch the code. Rather, the translator works by translating .po files with Poedit ( http://www.poedit.net/ ). You place those .po files in your locale directories from which php-gettext accesses and uses it to replace the strings. Have fun and good luck, I got it to work pretty well
Decent introduction article on the subject:
http://onlamp.com/pub/a/php/2002/11/28/php_i18n.html
We could really do with more info.
We are in the process of making our website international, allowing multiple languages.
I've looked into php's "gettext" however, if I understand it right, I see a big flaw:
If my webpage has let's say "Hello World" as a static text. I can put the string as <?php echo gettext("Hello World"); ?>, generate the po/mo files using a tool. Then I would give the file to a translator to work on.
A few days later we want to change the text in English to say "Hello Small World"?
Do I change the value in gettext? Do I create an english PO file and change it there?
If you change the gettext it will consider it as a new string and you'll instantly loose the current translation ...
It seems to me that gradually, the content of the php file will have old text everywhere.
Or people translating might have to be told "when you see Hello World, instead, translate Hello Small World".
I don't know I'm getting confused.
In other programming languages, I've seen that they use keywords such as web.home.featured.HelloWorld.
What is the best way to handle translations in PHP?
Thanks
You basically asked and answered your own question, the answer might just be having a slightly better understanding of how PO files work.
Within the PO file you have a msgid and a msgstr. The msgid is the value which is replaced with the msgstr within the PHP file depending on the localization.
Now you can make those msgid's anything you would like, you could very well make it:
<?php echo _("web.home.featured.HelloWorld"); ?>
And then you would never touch this string again within the source, you only edit the string through the PO files.
So basically the answer to your question is you make the gettext values identifiers for what the string should say, however the translators typically use the default language files text as the basis for conversion, not the identifier itself.
I hope this is clear.
I know an answer has been accepted, and the above answer is good. But there is another issue with using permanent machine-style keys like thing.stuff.widget when working with Gettext.
While using permanent keys is a better approach to development, Gettext is not set up for that style of working and this can complicate your workflow.
If you present a translator with a PO file populated with keys in place of source text, they may not know what the English should be. So you'd have to provide them with a second file containing source language translations for them to compare to. Not the end of the world, but more fiddly for them and not how Gettext was designed. (square peg, round hole etc..)
I think PO is perfectly fine as a file format for translations in PHP, and especially recommended if you're not working with a framework that has a good l10n module, but that doesn't mean it's good for workflow and your translation process.
I suggest you arrive at a workflow that allows your programmers to work with permanent keys, your translators work with words, and gives you a MO file out the other end. Take a look at Loco for one solution to this.
Alternatively use a different interim file format that allows the use of keys and words. TMX is one example. If you still want to use Gettext at runtime you can convert the files.
Currently, I am dealing with the same issue. The common practice with gettext is to use the English text as the key. Recently, our copy editor changed whole bunch of English text (other languages are hardly touched) so we have to change all the source code all the PO files.
We are switching to a neutral key. Since we already have some sites on Java. We will use the same property name format.
I am working on a large multilingual website and I am considering different approaches for making it multilingual. The possible alternatives I can think of are:
The Gettext functions with generation of .po files
One MySQL table with the translations and a unique string ID for each text
PHP-files with arrays containing the different translations with unique string IDs
As far as I have understood the Gettext functions should be most efficient, but my requirement is that it should be possible to change a text string in the original reference language (English) without the other translations of that string automatically reverting back to English just because a couple of words changed. Is this possible with Gettext?
What is the least resource demanding solution?
Is using the Gettext functions or PHP files with arrays more or less equally resource demanding?
Any other suggestions for more efficient solutions?
A few considerations:
1. Translations
Who will be doing the translations? People that are also connected to the site? A translation agency? When using Gettext you'll be working with 'pot' (.po) files. These files contain the message ID and the message string (the translation). Example:
msgid "A string to be translated would go here"
msgstr ""
Now, this looks just fine and understandable for anyone who needs to translate this. But what happens when you use keywords, like Mike suggests, instead of full sentences? If someone needs to translate a msgid called "address_home", he or she has no clue if this is should be a header "Home address" or that it's a full sentence. In this case, make sure to add comments to the file right before you call on the gettext function, like so:
/// This is a comment that will be included in the pot file for the translators
gettext("ready_for_lost_episode");
Using xgettext --add-comments=/// when creating the .po files will add these comments. However, I don't think Gettext is ment to be used this way. Also, if you need to add comments with every text you want to display you'll a) probably make an error at some point, b) you're whole script will be filled with the texts anyway, only in comment form, c) the comments needs to be placed directly above the Gettext function, which isn't always convient, depending on the position of the function in your code.
2. Maintenance
Once your site grows (even further) and your language files along with it, it might get pretty hard to maintain all the different translations this way. Every time you add a text, you need to create new files, send out the files to translators, receive the files back, make sure the structure is still intact (eager translators are always happy to translate the syntax as well, making the whole file unusable :)), and finish with importing the new translations. It's doable, sure, but be aware with possible problems on this end with large sites and many different languages.
Another option: combine your 2nd and 3rd alternative:
Personally, I find it more useful to manage the translation using a (simple) CMS, keeping the variables and translations in a database and export the relevent texts to language files yourself:
add variables to the database (e.g.: id, page, variable);
add translations to these variables (e.g.: id, varId, language, translation);
select relevant variables and translations, write them to a file;
include the relevant language file in your site;
create your own function to display a variables text:
text('var'); or maybe something like __('faq','register','lost_password_text');
Point 3 can be as simple as selecting all the relevant variables and translations from the database, putting them in an array and writing the serlialized array to a file.
Advantages:
Maintenance. Maintaining the texts can be a lot easier for big projects. You can group variables by page, sections or other parts within your site, by simply adding a column to your database that defines to which part of the site this variable belongs. That way you can quickly pull up a list of all the variables used in e.g. the FAQ page.
Translating. You can display the variable with all the translations of all the different languages on a single page. This might be useful for people who can translate texts into multiple languages at the same time. And it might be useful to see other translations to get a feel for the context so that the translation is as good as possible. You can also query the database to find out what has been translated and what hasn't. Maybe add timestamps to keep track of possible outdated translations.
Access. This depends on who will be translating. You can wrap the CMS with a simple login to grant access to people from a translation agency if need be, and only allow them to change certain languages or even certain parts of the site. If this isn't an option you can still output the data to a file that can be manually translated and import it later (although this might come with the same problems as mentioned before.). You can add one of the translations that's already there (English or another main language) as context for the translator.
All in all I think you'll find that you'll have a lot more control over the translations this way, especially in the long run. I can't tell you anything about speed or efficiency of this approach compared to the native gettext function. But, depending on the size of the language files, I don't think it'll be a big difference. If you group the variables by page or section, you can alway include only the required parts.
After some testing I finally decided to go more or less with the lines of Alecs' combination of the second and third alternative.
Gettext problem
I tried to set up the whole gettext-system first to try it out, but it turned out to be much more complicated then I thought. The problem is that Windows and Unix systems use different language shortnames for setlocale(). For the moment I'm running my dev-server on Windows with Wamp, while the final site will run on Linux. After I went through a couple of dozen guides, forums, questions etc. and restarting the server after each modification. I couldn't get it setup properly in any easy way it seemed. Additionally gettext is not threadsafe, to update the language file the server needs to be restarted or a hack needs to be used, there is no easy way of handling different versions of language files or handling the original English text without modifying the source or using Mikes suggestion, which as Alec pointed out isn't optimal.
Solution
So I ended up with what I think is the best solution based on Alecs response:
Save all the translations in a DB with the fields; language, page, var_key, version, revision and last_modified_time - where the version is corresponds to versions of the original translation (English), while revision allows the translator to modify/correct the finalized translations within a version.
Use a kind of CMS for translation, which is connected to the DB and handles different versions and allows for an easy overview of which languages are translated, in which version and how complete the translations are.
When a revision of a version is finalized a cache files are generated - each file contains an array with only the var_key and text-translation for one language and one page and are named with the ISO 639-1 names of the languages and the page name like: lang/en_index.php These language files are then simply included and wrapped in a function t($var_key) which allows for using the DB during the development, while then changed to only use the cache files.
Performance
I never got around to test gettext, but according to the link Mike posted the difference in performance between using an array and gettext is totally acceptable for me for the benefits which a custom system gives as described above. However, I compared using an array with 20 translated text-strings in an array compared to retrieving the same 20 text-strings from a MySQL DB. It turned out that using an array included from a file was aeound 6 times faster than retrieving all the 20 strings at the same time from the MySQL DB. It was no really scientific benchmark and the results may surely vary on different systems and setups, but it clearly shows exactly what I expected - that it would be much slower using a DB than using an array directly, which is why I choose to generate cache-files for the array instead of using the DB.
As a comparison I also tested how fast it was to only output simple echos with the same text. It turned out to be around 20 times faster than using arrays from an included file, but well - then it is not possible to translate without having different versions of the page for different languages, which defies the purpose of dynamic pages. Then it is better to also use a good cachesystem.
Performance test source files:
PHP: http://pastie.org/964082
MySQL table: http://pastie.org/964115
It is surely not perfect, but at least creates an idea about the performance differences.
Rather than having to use the English text as the keys you could arbitrarily do this but also provide english translations i.e.
gettext key is 'hello'
You then have your various language translations of this and an english translation of this that is also 'hello', then if you want to update the english version of the string you can leave the key alone and just update the english translation.
We're getting ready to translate our PHP website into various languages, and the gettext support in PHP looks like the way to go.
All the tutorials I see recommend using the english text as the message ID, i.e.
gettext("Hi there!")
But is that really a good idea? Let's say someone in marketing wants to change the text to "Hi there, y'all!". Then don't you have to update all the language files because that string -- which is actually the message ID -- has changed?
Is it better to have some kind of generic ID, like "hello.message", and an english translations file?
Wow, I'm surprised that no one is advocating using the English as a key. I used this style in a couple of software projects, and IMHO it worked out pretty well. The code readability is great, and if you change an English string it becomes obvious that the message needs to be considered for re-translation (which is a good thing).
In the case that you're only correcting spelling or making some other change that definitely doesn't require translation, it's a simple matter to update the IDs for that string in the resource files.
That said, I'm currently evaluating whether or not to carry this way of doing I18N forward to a new project, so it's good to hear some thoughts on why it might not be a good idea.
I strongly disagree with Richard Harrisons answer about which he states it is "the only way". Dear asker, do not trust an answer that states it is the only way, because the "only way" doesn't exist.
Here is another way which IMHO has a few advantages over Richards approach:
Start with using the proto-version of the English string as Original.
Don't display these proto-strings but create a translation file for English nontheless
Copy the proto-strings to the translation for the beginning
Advantages:
readable code
text in your code is very close if not identical to what your view displays
if you want to change the English text, you don't change the proto-string but the translation
if you want to translate the same thing twice, just write a slightly different proto-string or just add 'version for this and that' and you still have a perfectly readable code
I use meaningful IDs such as "welcome_back_1" which would be "welcome back, %1" etc. I always have English as my "base" language so in the worst case scenario when a specific language doesn't have a message ID, I fall-back on English.
I don't like to use actual English phrases as message ID's because if the English changes so does the ID. This might not affect you much if you use some automated tools, but it bothers me. I don't like to use simple codes (like msg3975) because they don't mean anything, so reading the code is more difficult unless you litter comments everywhere.
The reason for the IDs being English is so that the ID is returned if the translation fails for whatever reason - the translation for the current language and token not being available, or other errors.
That of course assumes the developer is writing the original English text, not some documentation person.
Also if the English text changes then probably the other translations need to be updated?
In practice we also use Pure IDs rather than then English text, but it does mean we have to do lots of extra work to default to English.
There is a lot to consider and answer is not so easy.
Using plain English
Pros
Easy to write and READ code
In most cases, it works even without running translation functions in code
Cons
Involved programmers must be also good copywriters :)
You need to write correct precise texts fully in English, even in the case that first language you need to run is something else (ie we're starting lof of projects in Czech language and we're localizing them to EN later).
In a lot of cases, you need to use contexts. If you fail to do it from begginig, it's a lot of work to add them later. To explain: In English, one word can have many different meands - and you need to use contexts to differentiate them - and it's not always so easy (order = sort order, or it can be purchase order).
It can be very hard to correct English later in the process. Corrections of the source strings will very often lead to loss of already translated phrases. It's very frustrating to loose translation to 3 different languages just because you corrected English.
Using keys
Pros
You can use localization platform functions even for the English language. I.e. we're using the lovely Crowdin platform. There is a lot of handy tools - or rather a complete workflow - for translation management: voting for different translations, translation history, glossaries (which helps to keep translation/language coherent), proofing, approval, etc. Using keys make this process much more smooth.
It's much easier to send Engish texts for proofreading etc. Usually, it's not a good idea to let copywriters to modify your code directly :)
Cons
More complicated project setup.
Harder to use %d, %s etc.
In a word don't do this.
The same word/phrase in English can often enough have more than one meaning, and each meaning a different translation.
Define mnemonic ids for your strings,and treat English as just another language.
Agree with other posters that id numbers in code are a nightmare for code readability.
Ex localisation engineer
Haven't you already answered your own question? :)
Clearly, if you intend to support i18n of your application, you should treat all the language implementations the same. If someone decides a string needs to change, you make a similar change in all the language files. The metadata with the checkin should group all the language files together in the same change. If your "default" language is handled differently, that makes it more difficult to maintain.
At the end of the day, a translator should be able to sit down and change the texts for every language (so they match in meaning) without having to involve the programmer that already did his/her job.
This makes me feel like the proper answer is to use a modified version of gettext where you put strings like this
_(id, backup_text, context)
_('ABOUT_ME', 'About Me', 'HOMEPAGE')
context being optional
why like this?
because you need to identify text in the system using unique ID's not english text that could get repeated elsewhere.
You should also keep the backup, id and context in the same place in your code to reduce discrepancies.
The id's also have to be readable, which brings in the problem of synonyms and duplicate use (even as ids), we could prefix the ids like this "HOMEPAGE_ABOUT_ME" or "MAIL_LETTER", but
people forget to do this at the start and changing it later is a problem
its more flexible for the system to be able to group both by id and context
which is why I also added the context variable at the end
the backup text can be pretty much anything, could even be "[ABOUT_ME#HOMEPAGE text failed to load, please contact example#example.com]"
It won't work with the current gettext editing programs like "poedit", but I think you can define custom variable names for translations like just "t()" without the underscore at the start.
I know that gettext also has support for contexts, but its not very well documented or widely used.
P.S. I'm not sure about the best variable order to enforce good and extendable code so suggestions are welcome.
I'd go so far as to say that you never (for most values of never) want to use free text as keys to anything. Imagine if SO used the query title as key to this page for instance. If someone links to it, and then the title is edited, the link is no longer valid.
Your problem is similar, except you would also be responsible for updating all links...
Like Douglas Leeder mentions, what you probably want to do is use English as the default (backup) language, although an interface that uses English and another language intermixed is highly confusing (but mildly amusing, too).
In addition to the considerations above, there are many cases where you'd want the "key" (msgid) to be different from the source text (English). For example, in the HTML view, I might want to say [yyyy] where the destination and label of that anchor tag depend on the locale of the user. E.g. it might be a link to a social network, and in US it would be Facebook but in China it would be Weibo. So the MsgIds might be something like socialSiteUrl and socialSiteLabel.
I use a mix.
For basic strings that I don't think will have conflicts/changes/weird meanings, I'll make the key be the same as the English.
We use Dutch. The strings should be written in the native language of the writer; this makes communication with translators less prone to errors, since the writer(s) can communicatie in their native language with them.