coding style for HTML i18n/l10n with variables

coding style for HTML i18n/l10n with variables - php

I have been working on web developement for quite some time now and I have always struggled to find a clean solution for a problem I have encountered during i18n of HTML strings, mostly anchor tags.
First of let me show you a typical problematic example. This is a frequently encountered string in HTML templates:
Welcome to my site. Check out our cool products
you should not miss.
How do I translate this string while still having the following properties:
Dynamic generation of the URL (e.g. using a router)
A translatable string that is as readable as possible (so translators can do it w/o looking at the code)
Because the string contains HTML, I probably want to escape some parts I insert (e.g. the URL), so I don't make myself vulnerable to XSS if this URL contains user input
It should look as good as possible in the code as well
How do you translate your strings when they contain dynamic content and HTML?

When I now want to apply i18n to that string, I probably turn to gettext or a framework function. Since I come from the PHP/Joomla! world, I used JText::_ before, which acts very similar to gettext. In Python I now use Babel. Both share the same problem and probably more languages, too. All code I share here is my way of doing it in Python, more explicitly, in my Mako templates
Of course, the problem is: There is HTML in our string to be translated (and a URL, for that matter). Here are my options, which I will each explain afterwards:
Passing the raw string to gettext
Splitting the text into three bits
Surrounding linked word with variables
Using one variable that gets build seperately
Passing the raw string to gettext
This one seems the first approach one might take, if not aware of the implications.
Approach 1:
_('Welcome to my site. Check out our cool products \
you should not miss.')
For this msgid you could now translate it, keeping the HTML intact.
Advantages:
This looks very clean in the code and is easy to understand
If the translator is keeping the HTML intact this does not produce any problems
Disadvantages:
The translator has to know at least a little HTML
The string is completely unflexible, e.g. if the URL changes, all translations have to be adjusted
It does not allow for dynamic generation of the URL using something like a router
So as a conclusion, while I used this I quickly hit my limit. My next idea was:
Splitting the text into three bits
Approach 2:
_('Welcome to my site. Check out our cool ') + '<a href="/products">' +\
_('products') + '</a>' + _(' you should not miss.')
Advantages:
The URL is completely flexible now
Only actual text for the translators
Disadvantages:
Splits a sentence into three parts
Translator has to know which parts relate together or he might not be able to produce meaningful sentences
Not very pretty in code
The msgid may be a single word, which can cause problems (beware of contexts) but can be fixed.
I used this technique for some time because I did not know about printf style strings in PHP (which I used back then). Because this looked so ugly, I tried a different approach:
Surrounding linked word with variables
Approach 3:
_('Welcome to my site. Check out our cool %sproducts%s you should not miss.' % \
('', '')
Advantages:
Single string to translate, a complete sentence
Translator gets the context right from the string
Code is not that ugly
Disadvantages:
Translator has to take care that no %s goes missing (might be confusion as it reads like sproducts)
Introduces two format string variables for every URL, one being only </a>
Using one variable that gets build seperately
From here I had some different approaches, but I finally came of with the one I currently use (which might look like overkill, but I perfer it for now).
Approach 4:
_('Welcome to my site. Check out our cool %s \
you should not miss.') % ('%s' % ('/products', _('products')))
Let me take some time to reason this (seemingly lunatic) approach. First of all, the actual translation string looks like this:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.')
Which leaves a translator with the information what is inserted there (that's the translationstring version). Second, I want to ensure that I can manually escape all parts that are inserted into the HTML. While Mako provides automatic escaping, this does not make sense in a statement like this:
${'This is a url'}
It would destroy the url so I have to apply the |n filter to remove any escaping. However, if any argument of that is user supplied, it also opens up to XSS which I want to prevent. Not taking any risk, I can just escape any input (the same way good template engines do by defualt) and then remove Mako's escaping for this one string. So
'%s' % ('/products', _('products'))
actually looks like
'%s' % (escape('/products'), _('products'))
where escape is imported from markupsafe (see Markupsafe).
The final part now is dynamic URLs through a router: request.route_url('products_view')
To combine each of these possibilities, I have to produce something very ugly (note that this uses the mapping keyword argument of translationstring (translationstring.TranslationString) but that combines all the benefits I want/need from translation:
Final result:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.', mapping={'product_url': '%s' %\
(escape(request.route_url('products_view')), _('products'))})
Advantages:
Full HTML escpaing
Fully dynamic
Very good msgids for translation
Disadvantages:
An extremely ugly construct in the template (or the program anyway)
The lingua extractor doesn't catch _('products') so we have do extract that manually
So that is it, this concludes my approaches to this problem. Maybe I am doing something way to complicated and you have a lot better ideas or maybe this is a problem that depends on specific types of translatable text (and one has to choose the right approach).
Did I miss any solution or anything that would improve my approach?

Related

translating php strings - include formatting characters or not?

I'm using po files to translate my application using the gettext function.
I have a lot of strings using formatting characters like spaces, colons, question marks, etc....
What's best practice here?
E.g.:
_('operating database: '). DB_NAME. _(' on ').DB_HOST;
_('Your name:');
or
_('operating database').': '. DB_NAME.' '._('on').' '.DB_HOST;
_('Your name').':';
Should I keep them in translation or is it better to let them hardcoded? What are the pros and cons?

Neither of your examples is good.
The best practice is to have one string per one self-contained displayed unit of text. If you're showing a message box, for example, then all of its content should be one translatable string, even if it has more than one sentence. A label: one string; a message: one string.
Never, unless you absolutely cannot avoid it, break a displayed piece of text into multiple strings concatenated in code, as the above examples do. Instead, use string formatting:
sprintf(_('operating database: %s on %s'), $DB_NAME, $DB_HOST);
The reason is that a) some translations may need to put the arguments in different order and b) it gives the translator some context to work with. For example, "on" alone can be translated quite differently in different sentences, even in different uses in your code, so letting the translator translate just that word would inevitably lead to poor, hard to understand, broken translations.
The GNU gettext manual has a chapter on this as well.

If you keep them in translation than all translations will duplicate them. This means all this spaces, colons and etc will be duplicated for each language. What for?
I'm standing for translating just the meaning parts of the strings (second variant).

Zend_Translate: not string, but ID based translation - or how to handle changes in original string

As far as I understood, Zend_Translate uses strings as keys for translation files. This means that if I change the original string (e.g. fix some typo), all translations for this string will be lost.
Is there a way to update those translations automatically? My idea is to mark those translations as "TODO" when the original string has changed.
To achieve this, I guess I have to use an ID based translation system instead of a string based translation system. Every string has a unique ID.
I know that a string based translation system has the advantage that equal strings do not have to be translated twice. This is a very rare use case in my application, so translation equal strings twice would be absolutely fine.
I thought of implementing this myself, but I don't know how to do it with good performance.
Any suggestions on this? Can Zend_Translate handle changes in the original string? Are there other translation systems that can handle this use case?

This can't be handled automatically. At least not easily. (it might be done with code analysis, code generation, etc.). We used "stubs" for translation - $this->translate('please-register') // Please register. Works fine, but adds more work for developers - they need to create translation files even for the native language. :)

how to check if a php file is obfuscated?

is there any way we can check if a php file has been obfuscated, using php? I was thinking regex maybe (for instance ioncube's encoded file contains a very long alphabet string, etc.

One idea is to check for whitespace. The first thing that an obfuscator will do is to remove extra whitespace. Another thing you can look for is the number of characters per line, as obfuscators will put all the code into few (one?) lines.

Often, obsfuscators initialize very large arrays to translate variables into less meaningful names (eg. see obsfucator article
One technique may be to search for these super-large arrays, close to the top of the class/file etc. You may be able to hook xdebug up to examine/look for these. The whole thing of course depends on the obsfuscation technique used. Check the source code, there may be patterns they've used that you can search on.

I think you can use token_get_all() to parse the file - then compute some statistics. For example check for number of function calls(in calse obfuscator uses some eval() string and nothing else) and calculate average function length - for obfuscators it will usually be about 3-5 chars, for normal PHP code it should be much bigger. You can also use dictionary lookup for function/variable names, check for comments etc. I think if you know all obfuscator formats that you want to detect - it will be easy.

Compiling an AST back to source code

I'm currently in the process of building a PHP Parser written in PHP, as no existing parser came up in my previous question. The parser itself works fairly well.
Now obviously a parser by itself does little good (apart from static analysis). I would like to apply transformations to the AST and then compile it back to source code. Applying the transformations isn't much of a problem, a normal Visitor pattern should do.
What my problem currently is, is how to compile the AST back to source. There are basically two possibilities I see:
Compile the code using some predefined scheme
Keep the formatting of the original code and apply 1. only on Nodes that were changed.
For now I would like to concentrate on 1. as 2. seems pretty hard to accomplish (but if you got tips concerning that, I would like to hear them).
But I'm not really sure which design pattern can be used to compile the code. The easiest way I see to implement this, is to add a ->compile method to all Nodes. The drawback I see here, is that it would be pretty hard to change the formatting of the generated output. One would need to change the Nodes itself in order to do that. Thus I'm looking for a different solution.
I have heard that the Visitor pattern can be used for this, too, but I can't really imagine how this is supposed to work. As I understand the visitor pattern you have some NodeTraverser that iterates recursively over all Nodes and calls a ->visit method of a Visitor. This sounds pretty promising for node manipulation, where the Visitor->visit method could simply change the Node it was passed, but I don't know how it can be used for compilation. An obvious idea would be to iterate the node tree from leaves to root and replace the visited nodes with source code. But this somehow doesn't seem a very clean solution?

The problem of converting an AST back into source code is generally called "prettyprinting". There are two subtle variations: regenerating the text matching the original as much as possible (I call this "fidelity printing"), and (nice) prettyprinting, which generates nicely formatted text. And how you print matters
depending on whether coders will be working on the regenerated code (they often want fidelity printing) or your only
intention is to compile it (at which point any legal prettyprinting is fine).
To do prettyprinting well requires usually more information than a classic parser collects, aggravated by the fact that most parser generators don't support this extra-information collection. I call parsers that collect enough information to do this well "re-engineering parsers". More details below.
The fundamental way prettyprinting is accomplished is by walking the AST ("Visitor pattern" as you put it), and generating text based on the AST node content. The basic trick is: call children nodes left-to-right (assuming that's the order of the original text) to generate the text they represent, interspersing additional text as appropriate for this AST node type. To prettyprint a block of statements you might have the following psuedocode:
PrettyPrintBlock:
Print("{"}; PrintNewline();
Call PrettyPrint(Node.children[1]); // prints out statements in block
Print("}"); PrintNewline();
return;
PrettyPrintStatements:
do i=1,number_of_children
Call PrettyPrint(Node.children[i]); Print(";"); PrintNewline(); // print one statement
endo
return;
Note that this spits out text on the fly as you visit the tree.
There's a number of details you need to manage:
For AST nodes representing literals, you have to regenerate the literal value. This is harder than it looks if you want the answer to be accurate. Printing floating point numbers without losing any precision is a lot harder than it looks (scientists hate it when you damage the value of Pi). For string literals, you have to regenerate the quotes and the string literal content; you have to be careful to regenerate escape sequences for characters that have to be escaped. PHP doubly-quoted string literals may be a bit more difficult, as they are not represented by single tokens in the AST. (Our PHP Front End (a reengineering parser/prettyprinter represents them essentially as an expression that concatenates the string fragments, enabling transformations to be applied inside the string "literal").
Spacing: some languages require whitespace in critical places. The tokens ABC17 42 better not be printed as ABC1742, but it is ok for the tokens ( ABC17 ) to be printed as (ABC17). One way to solve this problem is to put a space wherever it is legal, but people won't like the result: too much whitespace. Not an issue if you are only compiling the result.
Newlines: languages that allow arbitrary whitespace can technically be regenerated as a single line of text. People hate this, even if you are going to compile the result; sometimes you have to look at the generated code and this makes it impossible. So you need a way to introduce newlines for AST nodes representing major language elements (statements, blocks, methods, classes, etc.). This isn't usually hard; when visiting a node representing such a construct, print out the construct and append a newline.
You will discover, if you want users to accept your result, that you will have to preserve some properties of the source text that you wouldn't normally think to store
For literals, you may have to regenerate the radix of the literal; coders having entered a number as a hex literal are not happy when you regenerate the decimal equivalent even though it means exactly the same thing. Likewise strings have to have the "original" quotes; most languages allow either " or ' as string quote characters and people want what they used originally. For PHP, which quote you use matters, and determines which characters in the string literal has to be escaped.
Some languages allow upper or lower case keywords (or even abbreviations), and upper and lower case variable names meaning the same variable; again the original authors typically want their original casing back. PHP has funny characters in different type of identifiers (e.g., "$") but you'll discover that it isn't always there (see $ variables in literal strings). Often people want their original layout formatting; to do this you have to store at column-number information for concrete tokens, and have prettyprinting rules about when to use that column-number data to position prettyprinted text where in the same column when possible, and what to do if the so-far-prettyprinted line is filled past that column.
Comments: Most standard parsers (including the one you implemented using the Zend parser, I'm pretty sure) throw comments away completely. Again, people hate this, and will reject a prettyprinted answer in which comments are lost. This is the principal reason that some prettyprinters attempt to regenerate code by using the original text
(the other is to copy the original code layout for fidelity printing, if you didn't capture column-number information). IMHO, the right trick is to capture the comments in the AST, so that AST transformations can inspect/generate comments too, but everybody makes his own design choice.
All of this "extra" information is collected by a good reenginering parser. Conventional parsers usually don't collect any of it, which makes printing acceptable ASTs difficult.
A more principled approach distinguishes prettyprinting whose purpose is nice formatting, from fidelity printing whose purpose is to regenerate the text to match the original source to a maximal extent. It should be clear that at the level of the terminals, you pretty much want fidelity printing. Depending on your purpose, you can pretty print with nice formatting, or fidelity printing. A strategy we use is to default to fidelity printing when the AST hasn't been changed, and prettyprinting where it has (because often the change machinery doesn't have any information about column numbers or number radixes, etc.). The transformations stamp the AST nodes that are newly generated as "no fidelity data present".
An organized approach to prettyprinting nicely is to understand that virtually all text-based programming language are rendered nicely in terms of rectangular blocks of text. (Knuth's TeX document generator has this idea, too). If you have some set of text boxes representing pieces of the regenerated code (e.g., primitive boxes generated directly for the terminal tokens), you can then imagine operators for composing those boxes: Horizontal composition (stack one box to the right of another), Vertical (stack boxes on top of each other; this in effect replaces printing newlines), Indent (Horizontal composition with a box of blanks), etc. Then you can construct your prettyprinter by building and composing text boxes:
PrettyPrintBlock:
Box1=PrimitiveBox("{"); Box2=PrimitiveBox("}");
ChildBox=PrettyPrint(Node.children[1]); // gets box for statements in block
ResultBox=VerticalBox(Box1,Indent(3,ChildBox),Box2);
return ResultBox;
PrettyPrintStatements:
ResultBox=EmptyBox();
do i=1,number_of_children
ResultBox=VerticalBox(ResultBox,HorizontalBox(PrettyPrint(Node.children[i]); PrimitiveBox(";")
endo
return;
The real value in this is any node can compose the text boxes produced by its children in arbitrary order with arbitrary intervening text. You can rearrange huge blocks of text this way (imagine VBox'ing the methods of class in method-name order). No text is spit out as encountered; only when the root is reached, or some AST node where it is known that all the children boxes have been generated correctly.
Our DMS Software Reengineering Toolkit uses this approach to prettyprint all the languages it can parse (including PHP, Java, C#, etc.). Instead of attaching the box computations to AST nodes via visitors, we attach the box computations in a domain-specific text-box notation
H(...) for Horizontal boxes
V(....) for vertical boxes
I(...) for indented boxes)
directly to the grammar rules, allowing us to succinctly express the grammar (parser) and the prettyprinter ("anti-parser") in one place. The prettyprinter box rules are compiled automatically by DMS into a visitor. The prettyprinter machinery has to be smart enough to understand how comments play into this, and that's frankly a bit arcane but you only have to do it once. An DMS example:
block = '{' statements '}' ; -- grammar rule to recognize block of statements
<<PrettyPrinter>>: { V('{',I(statements),'}'); };
You can see a bigger example of how this is done for Wirth's Oberon programming language PrettyPrinter showing how grammar rules and prettyprinting rules are combined. The PHP Front End looks like this but its a lot bigger, obviously.
A more complex way to do prettyprinting is to build a syntax-directed translator (means,
walk the tree and build text or other data structures in tree-visted order) to produce text-boxes in a special text-box AST. The text-box AST is then prettyprinted by another tree walk, but the actions for it are basically trivial: print the text boxes.
See this technical paper: Pretty-printing for software reengineering
An additional point: you can of course go build all this machinery yourself. But the same reason that you choose to use a parser generator (its a lot of work to make one, and that work doesn't contribute to your goal in an interesting way) is the same reason you want to choose an off-the-shelf prettyprinter generator. There are lots of parser generators around. Not many prettyprinter generators. [DMS is one of the few that has both built in.]

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.

One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.

You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars

For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.

A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)

My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).

You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)

If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..

The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.

OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).

Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.

there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.