Gettext blog posts different languages

Gettext blog posts different languages - php

I've been looking already over a month for a universal solution and hadn't come up with any.
I need to have my website international everywhere, not just UI. I also need to have my blog posts multilingual, but they are dynamic (created via CMS).
There are two problems:
I can use GNU Gettext for UI's localization but barely can imagine to use it for blog posts (except using some scary things, like parsing and editing .po files through PHP, then compiling it to .mo with unreliable scripts, then using some tricks to avoid Gettext cache...). So I've come up with an idea to make some database-based localization for dynamic content.
But it means that I need to use two different localization systems: GNU Gettext and database-based. It's ridiculous. So I need to make all localization database-based.
Am I right? Are there smarter solutions? Would appreciate a lot any advice.

Do it the other way round.
Extract the translatable strings from your database into a .pot file, for example 'database.pot'. When you extract the messages from your php sources, do it (more or less) like this:
xgettext *.php database.pot
You can always use .po or .pot files as input to xgettext.
Then, use your regular translation workflow, and once you have to translated .po files, parse the translations out of the files, and write them back into the database.

Related

How can I translate several HTML files without using Google Translator's Kit

I need to translate several PHP files (HTML Code + PHP Tags) into another language.
Google Translator's Kit allows this, but clears the PHP Tags, erases class="" attributes (?!) and adds html, head tags & what not. Completely useless.
How can I (ideally in batch) translate these files using any kind of automated translation service?
Thanks.

bmargulies is the most clear-cut way of doing it. However, it takes times.
If you're in a pinch, or want to cut corners, a relatively simple way to do it is to use regular expressions to filter your code out yourself. Match over multiple lines (/s flag in preg), store the match, and replace with a hash. Any hash. Just make sure it doesn't map to anything in any language.
Do the same for HTML tags if they are proving to be annoying to Google.
Translate with Google.
Replace back the hashes. Voila! Job done! If you're feeling even more daring, instead of replacing the hashes back, replace them with an l18n-suitable structure might prove to be even more worthwhile.

You need to internationalize the code. You need to move all the translatable strings out into a separate file, so that you can shove that through Google and then easily drop in the results.
Researching the topic of PHP I18N will prove rewarding.

Google Translator toolkit is for documents - not so much for source code. You can organize your program's strings as documents and translate them in Google Translator toolkit, and there are, in fact, software projects that do it, but it's contrived. It would be much better to use a different method, as the other people here say.
Put the translatable strings in separate files - you can use something like YAML or JSON, for example, or to just organize your strings as PHP arrays (that's how it's done in MediaWiki, for example). Each message should have a key. Use one file per language or one file with all the languages, and the strings grouped by languages. (By the way, use ISO 639-3 language codes - don't make up your own. Then you'll be able to reuse them in HTML lang attributes.)
After you organized your strings like that, write functions that load the strings from these files by message key and language code, and use these functions to display the messages - never use hardcoded strings.
Finally, put your files up for translation using software such as Pootle, Transifex, Zanata, or the MediaWiki Translate extension.
(Disclaimer: I am a developer of the MediaWiki Translate extension.)

Using something like Gettext (namely php-gettext) is IMHO best approach to do that. Another widely used option is to simply extract strings to separate files (be it PHP or JSON) and translate these. However I'd recommend to use Gettext as you will be using standard format with wide range of available tools.

ResourceBundle Editors

I have decided to use ResourceBundles to provide translations for my php application. While the ResourceBundle (http://www.php.net/manual/en/class.resourcebundle.php) functionality is quite new, I prefer the way it works over gettext.
The question I have is that are there any good ResourceBundle editors? In particular I am looking for one that will scan all my source files and generate a list of message IDs that require translations and/or updating.
Previously, I have used POEdit to generate translation files for gettext. It is able to scan my source files for functions such as _() and hand me a list of message ids to translate.
I have tried installing an eclipse plugin (http://sourceforge.net/projects/eclipse-rbe/) and while it has a nice GUI, it doesn't scan my source files to generate a list of message IDs to translate.
Could anyone recommend any resource bundle editors?

After spending most of the day trying out loads of different tools and what not, I have devised this workflow, and howpfully it will help others as well.
ResourceBundles are quite new to PHP and there isn't really much information. Surprisingly, while resource bundles have been used by various languages such as Java for quite a while, I wasn't able to find any tools that can deal with RBs in a general manner.
SirDarius's suggestion to try RB Manager was a good start. It's straight forward to use, but there are some issues:
To scan your source code, you need to set up the scanner using an xml file, which might not be intuitive.
More importantly, the scanner will report a list of unused keys and new keys. These still need to be manually added into your resource bundles using the RB Manager main application.
Finally, the ICU files exported by RB Manager are broken. Where as content should be exported as just content by itself, somehow RB Manager converts them into the decimal equivalent of the ASCII code. This makes the file to be unuseable.
I have tried various tools to convert between formats, namely the XLIFF format, but I find that the XLIFF files generated are often malformed, and the next piece of software would refuse to process it.
For those who might run into this issue in the future, here is what I have done:
I am assuming that you will have some classes or function to wrap around the MessageFormatter and ResourceBundle classes. In my application, I use something like $translate->_('text'); to perform translation. The trick is to use POEdit. POEdit will scan your source files for _() and get a list of keys and remove old keys. Remember that in MessageFormatter, you only use a message id, for example system.warning.reason instead of a full string "Your action was denied. This has been logged".
You should then use POEdit to write your translations. Dealing with plurals is a bit different. You should not set up any rules for dealing with plurals. Dealing with plurals is done inline for the translation string, which is really flexible. See here: http://userguide.icu-project.org/formatparse/messages for some examples.
Once your translation has been completed, I wrote a small PHP script to convert the .mo files generated by POEdit into the equivalent ICU files.For parsing the .mo files, I used the gettext adapter in Zend_Translate. It also contained a function to grab all message IDs and messages, which is extremely useful. You then convert that data into ICU format like so:
root {
// message ID {" Pattern "}
system.warning.reason { "Your action was denied. This has been logged" }
}
Once this is done, you need to download the ICU package from : http://site.icu-project.org/download. In the bin directory, you will find an executable called genrb, which compiles the resourcebundle into a binary for PHP to use.
The command is genrb inputfile.txt -e UTF-8 notice that the input encoding is specified as UTF-8. You can change this to whatever encoding your input files uses.
That's it. I believe this is the simpliest and most productive workflow when generating and translating resourcebundles for PHP.
If anyone comes up with a better way, or perhaps even a full standalone program to take care of this, please post a comment!

You can try this: http://icu-project.org/download/rbmanager.html
The Resource Bundle Manager written for the ICU project on which PHP resource bundle class is based.

Localize existing PHP application

I have an existing, database driven PHP application with around 40 pages printing an unknown (but rather large) number of English strings. The strings are currently all hard coded. There is also a set of static documentation pages. I now need to add language support to this application.
As far as I can tell gettext seems to be the standard solution to this problem, but gettext feels very much like a "hack" to me. I am also not certain about the additional overhead (both in development and run-time) it will cause. Are there any other solutions or frameworks that could be better suited to my requirements? Any best practices or pitfalls I should be aware of when starting this project?

gettext is the "standard" for multi-lingual text storage, but in the end, it is just a storage engine. It does nothing to put the proper text into your page. You need to abstract the text from your View so the proper text can be inserted.
In the end, you need a way to place the language text in your document. This means "tagging" the text in a certain way so that it can be searched and replaced by with desired language text. For the e-commerce site I manage (3 languages) I used a technique derived from Facebook FBML tagging system.
You can wrap your text in "tags" <trans id="slt">something like this</trans>. Then use the DOM tools in PHP to extract the ID, look up the translated text based on the id, replace the content between the tags with the proper language text. You can still use gettext for your storage mechanism, or your database. Your browser will ignore the tags, so your page will still look fine during development.
This is just one tagging example. You can use any tagging mechanism and use grep instead to extract, search and replace.
For static pages, you can pre-generate the translated versions and load the appropriate language version. This way there is no extra overhead for different languages.

Creating a bilingual site

I have been working on a site from last 2 years.
Now my client want's make it bilingual, with English and Chinese.
Any idea what should I do for it?

You can have seperate constant files, each file has a bunch of constants in it, then you go through the pages replacing actual sentences with the constants.
Have a cookie that selects the correct constants file that the user has chosen.
This sort of design is good as it allows for more language to be added easily in the future.
It is however a pain in the ass to do.
You could go for some sort of automated translator but it wont read naturally.

I had to do this and since you are working with PHP, I recommend gettext
http://php.net/manual/en/book.gettext.php
And here's an article to get you started on it
http://mel.melaxis.com/devblog/2005/08/06/localizing-php-web-sites-using-gettext/
The benefit of gettext is that the translator never has to touch the code. Rather, the translator works by translating .po files with Poedit ( http://www.poedit.net/ ). You place those .po files in your locale directories from which php-gettext accesses and uses it to replace the strings. Have fun and good luck, I got it to work pretty well

Decent introduction article on the subject:
http://onlamp.com/pub/a/php/2002/11/28/php_i18n.html
We could really do with more info.

Most efficient approach for multilingual PHP website

I am working on a large multilingual website and I am considering different approaches for making it multilingual. The possible alternatives I can think of are:
The Gettext functions with generation of .po files
One MySQL table with the translations and a unique string ID for each text
PHP-files with arrays containing the different translations with unique string IDs
As far as I have understood the Gettext functions should be most efficient, but my requirement is that it should be possible to change a text string in the original reference language (English) without the other translations of that string automatically reverting back to English just because a couple of words changed. Is this possible with Gettext?
What is the least resource demanding solution?
Is using the Gettext functions or PHP files with arrays more or less equally resource demanding?
Any other suggestions for more efficient solutions?

A few considerations:
1. Translations
Who will be doing the translations? People that are also connected to the site? A translation agency? When using Gettext you'll be working with 'pot' (.po) files. These files contain the message ID and the message string (the translation). Example:
msgid "A string to be translated would go here"
msgstr ""
Now, this looks just fine and understandable for anyone who needs to translate this. But what happens when you use keywords, like Mike suggests, instead of full sentences? If someone needs to translate a msgid called "address_home", he or she has no clue if this is should be a header "Home address" or that it's a full sentence. In this case, make sure to add comments to the file right before you call on the gettext function, like so:
/// This is a comment that will be included in the pot file for the translators
gettext("ready_for_lost_episode");
Using xgettext --add-comments=/// when creating the .po files will add these comments. However, I don't think Gettext is ment to be used this way. Also, if you need to add comments with every text you want to display you'll a) probably make an error at some point, b) you're whole script will be filled with the texts anyway, only in comment form, c) the comments needs to be placed directly above the Gettext function, which isn't always convient, depending on the position of the function in your code.
2. Maintenance
Once your site grows (even further) and your language files along with it, it might get pretty hard to maintain all the different translations this way. Every time you add a text, you need to create new files, send out the files to translators, receive the files back, make sure the structure is still intact (eager translators are always happy to translate the syntax as well, making the whole file unusable :)), and finish with importing the new translations. It's doable, sure, but be aware with possible problems on this end with large sites and many different languages.
Another option: combine your 2nd and 3rd alternative:
Personally, I find it more useful to manage the translation using a (simple) CMS, keeping the variables and translations in a database and export the relevent texts to language files yourself:
add variables to the database (e.g.: id, page, variable);
add translations to these variables (e.g.: id, varId, language, translation);
select relevant variables and translations, write them to a file;
include the relevant language file in your site;
create your own function to display a variables text:
text('var'); or maybe something like __('faq','register','lost_password_text');
Point 3 can be as simple as selecting all the relevant variables and translations from the database, putting them in an array and writing the serlialized array to a file.
Advantages:
Maintenance. Maintaining the texts can be a lot easier for big projects. You can group variables by page, sections or other parts within your site, by simply adding a column to your database that defines to which part of the site this variable belongs. That way you can quickly pull up a list of all the variables used in e.g. the FAQ page.
Translating. You can display the variable with all the translations of all the different languages on a single page. This might be useful for people who can translate texts into multiple languages at the same time. And it might be useful to see other translations to get a feel for the context so that the translation is as good as possible. You can also query the database to find out what has been translated and what hasn't. Maybe add timestamps to keep track of possible outdated translations.
Access. This depends on who will be translating. You can wrap the CMS with a simple login to grant access to people from a translation agency if need be, and only allow them to change certain languages or even certain parts of the site. If this isn't an option you can still output the data to a file that can be manually translated and import it later (although this might come with the same problems as mentioned before.). You can add one of the translations that's already there (English or another main language) as context for the translator.
All in all I think you'll find that you'll have a lot more control over the translations this way, especially in the long run. I can't tell you anything about speed or efficiency of this approach compared to the native gettext function. But, depending on the size of the language files, I don't think it'll be a big difference. If you group the variables by page or section, you can alway include only the required parts.

After some testing I finally decided to go more or less with the lines of Alecs' combination of the second and third alternative.
Gettext problem
I tried to set up the whole gettext-system first to try it out, but it turned out to be much more complicated then I thought. The problem is that Windows and Unix systems use different language shortnames for setlocale(). For the moment I'm running my dev-server on Windows with Wamp, while the final site will run on Linux. After I went through a couple of dozen guides, forums, questions etc. and restarting the server after each modification. I couldn't get it setup properly in any easy way it seemed. Additionally gettext is not threadsafe, to update the language file the server needs to be restarted or a hack needs to be used, there is no easy way of handling different versions of language files or handling the original English text without modifying the source or using Mikes suggestion, which as Alec pointed out isn't optimal.
Solution
So I ended up with what I think is the best solution based on Alecs response:
Save all the translations in a DB with the fields; language, page, var_key, version, revision and last_modified_time - where the version is corresponds to versions of the original translation (English), while revision allows the translator to modify/correct the finalized translations within a version.
Use a kind of CMS for translation, which is connected to the DB and handles different versions and allows for an easy overview of which languages are translated, in which version and how complete the translations are.
When a revision of a version is finalized a cache files are generated - each file contains an array with only the var_key and text-translation for one language and one page and are named with the ISO 639-1 names of the languages and the page name like: lang/en_index.php These language files are then simply included and wrapped in a function t($var_key) which allows for using the DB during the development, while then changed to only use the cache files.
Performance
I never got around to test gettext, but according to the link Mike posted the difference in performance between using an array and gettext is totally acceptable for me for the benefits which a custom system gives as described above. However, I compared using an array with 20 translated text-strings in an array compared to retrieving the same 20 text-strings from a MySQL DB. It turned out that using an array included from a file was aeound 6 times faster than retrieving all the 20 strings at the same time from the MySQL DB. It was no really scientific benchmark and the results may surely vary on different systems and setups, but it clearly shows exactly what I expected - that it would be much slower using a DB than using an array directly, which is why I choose to generate cache-files for the array instead of using the DB.
As a comparison I also tested how fast it was to only output simple echos with the same text. It turned out to be around 20 times faster than using arrays from an included file, but well - then it is not possible to translate without having different versions of the page for different languages, which defies the purpose of dynamic pages. Then it is better to also use a good cachesystem.
Performance test source files:
PHP: http://pastie.org/964082
MySQL table: http://pastie.org/964115
It is surely not perfect, but at least creates an idea about the performance differences.

Rather than having to use the English text as the keys you could arbitrarily do this but also provide english translations i.e.
gettext key is 'hello'
You then have your various language translations of this and an english translation of this that is also 'hello', then if you want to update the english version of the string you can leave the key alone and just update the english translation.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.