How to escape/strip special characters in the LaTeX document?

How to escape/strip special characters in the LaTeX document? - php

We implemented the online service where it is possible to generate PDF with predefined
structure. The user can choose a LaTeX template and then compile it with an appropriate inputs.
The question we worry about is the security, that the malicious user was not able to gain shell access through the injection of special instruction into latex document.
We need some workaround for this or at least a list of special characters that we should strip from the input data.
Preferred language would be PHP, but any suggestions, constructions and links are very welcomed.
PS. in few word we're looking for mysql_real_escape_string for LaTeX

Here's some code to implement the Geoff Reedy answer. I place this code in the public domain.
<?
$test = "Test characters: # $ % & ~ _ ^ \ { }.";
header( "content-type:text/plain" );
print latexSpecialChars( $test );
exit;
function latexSpecialChars( $string )
{
$map = array(
"#"=>"\\#",
"$"=>"\\$",
"%"=>"\\%",
"&"=>"\\&",
"~"=>"\\~{}",
"_"=>"\\_",
"^"=>"\\^{}",
"\\"=>"\\textbackslash",
"{"=>"\\{",
"}"=>"\\}",
);
return preg_replace( "/([\^\%~\\\\#\$%&_\{\}])/e", "\$map['$1']", $string );
}

The only possibility (AFAIK) to perform harmful operations using LaTeX is to enable the possibility to call external commands using \write18. This only works if you run LaTeX with the --shell-escape or --enable-write18 argument (depending on your distribution).
So as long as you do not run it with one of these arguments you should be safe without the need to filter out any parts.
Besides that, one is still able to write other files using the \newwrite, \openout and \write commands. Having the user create and (over)write files might be unwanted? So you could filter out occurrences of these commands. But keeping blacklists of certain commands is prone to fail since someone with a bad intention can easily hide the actual command by obfusticating the input document.
Edit: Running the LaTeX command using a limited account (ie no writing to non latex/project related directories) in combination with disabling \write18 might be easier and more secure than keeping a blacklist of 'dangerous' commands.

According to http://www.tug.org/tutorials/latex2e/Special_Characters.html the special characters in latex are # $ % & ~ _ ^ \ { }. Most can be escaped with a simple backslash but _ ^ and \ need special treatment.
For caret use \^{} (or \textasciicircum), for tilde use \~{} (or \textasciitilde) and for backslash use \textbackslash
If you want the user input to appear as typewriter text, there is also the \verb command which can be used like \verb+asdf$$&\~^+, the + can be any character but can't be in the text.

In general, achieving security purely through escaping command sequences is hard to do without drastically reducing expressivity, since it there is no principled way to distinguish safe cs's from unsafe ones: Tex is just not a clean enough programming language to allow this. I'd say abandon this approach in favour of eliminating the existence of security holes.
Veger's summary of the security holes in Latex conforms with mine: i.e., the issues are shell escapes and file creation.overwriting, though he has missed a shell escape vulnerability. Some additional points follow, then some recommendations:
It is not enough to avoid actively invoking --shell-escape, since it can be implicitly enabled in texmf.cnf. You should explicitly pass --no-shell-escape to override texmf.cnf;
\write18 is a primitive of Etex, not Knuth's Tex. So you can avoid Latexes that implement it (which, unfortunately, is most of them);
If you are using Dvips, there is another risk: \special commands can create .dvi files that ask dvips to execute shell commands. So you should, if you use dvips, pass the -R2 command to forbid invoking of shell commands;
texmf.cnf allows you to specify where Tex can create files;
You might not be able to avoid disabling creation of fonts if you want your clients much freedom in which fonts they may create. Take a look at the notes on security for Kpathsea; the default behaviour seems reasonable to me, but you could have a per user font tree, to prevent one user stepping on another users toes.
Options:
Sandbox your client's Latex invocations, and allow them freedom to misbehave in the sandbox;
Trust in kpathsea's defaults, and forbid shell escapes in latex and any other executables used to build the PDF output;
Drastically reduce expressivity, forbidding your clients the ability to create font files or any new client-specified files. Run latex as a process that can only write to certain already existing files;
You can create a format file in which the \write18 cs, and the file creation css, are not bound, and only macros that invoke them safely, such as for font/toc/bbl creation, exist. This means you have to decide what functionality your clients have: they would not be able to freely choose which packages they import, but must make use of the choices you have imposed on them. Depending on what kind of 'templates' you have in mind, this could be a good option, allowing use of packages that use shell escapes, but you will need to audit the Tex/Latex code that goes into your format file.
Postscript
There's a TUGBoat article, Server side PDF generation based on LATEX templates, addressing another take on the question to the one I have taken, namely generating PDFs from form input using Latex.

You'd probably want to make sure that your \write18 is disabled.
See http://www.fceia.unr.edu.ar/lcc/cdrom/Instalaciones/LaTex/MiKTex/doc/ch04s08.html and http://www.texdev.net/2009/10/06/what-does-write18-mean/

Related

special symbols in filename doesn't display correct in mPDF

I have code:
$mpdf = new mPDF();
$mpdf->WriteHTML('some html text');
return $mpdf->Output("123!##$%^&*()_+<><?:}{P}" . '.pdf', 'I');
But when I save document in filename instead symbols <>?: displays -----.
Can it be fixed?

First of all, this question has nothing to do with PDF generation. You want to create a file system object with a name that includes characters that have a special meaning in some shells:
< is the input redirecton operator
> is the output redirection operator
? is the any character wildcard
: is the Windows drive letter separator
And to you want to accomplish it through an additional layer you don't have control over (I assume a web browser).
Some file systems (not all) treat object names as raw byte strings and do not impose any condition. I recall being able to create files in an old Unix box that contained a * character and a line feed, after I read a book that explained such thing was possible. However, a file name goes though several software layers, many of which actually need to understand the name, and some of them will possibly impose additional restrictions to those of the file system itself. So, even if you manage to create the file, you might not be able to read it back later.
For this reason, the browser actively removes problematic characters. In some cases, it might be overzealous (: is safe on Unix) but it just tries to prevent potential issues (e.g. the Unix file is emailed or copied to a Windows share) and there's nothing you can do on the server to avoid that.

Allow only hexadecimal html entities

Got a forum and posting HTML is forbidden.
However, some users would like to have the possibility to post some symbolic signs, hexadecimal html entities, such as:
💗
See: http://graphemica.com/%F0%9F%92%97 for more info.
My questions are:
Is this safe to allow them such symbols at all (XSS, etc..)?
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
I want to disallow members using & or » and so on, so just html-entities starting with &# and followed by a number plus the semicolon at the end.
Any idea how to solve this?

Another answer is to use jQueries .text method to add the message to your forum message element.
Although you will have to change how your forum creates the message structure.
You can safely add any sequence of characters and none of them will be interpreted by the browser as HTML.
Example:
$('#message_text').text(naughty_msg_string);

Is this safe to allow them such symbols at all (XSS, etc..)?
No, this is never safe. For example, & is just a convenient alias for &, which is still an ampsersand. Similarly < is a lesser-than sign, and thus 'naively' allowing numeric HTML-entities can still open up an XSS attack surface, if you forget this during processing.
You could consider only allowing numeric symbols outside the main ASCII table (128+), which would be more safe.
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
Considering the above function, preg_replace_callback is a good candidate, as it allows you to test the content before (dis)allowing it.
This also answers the third question, as you can just test for numbers in the regexp.

Do I need to sanitize input to file_exists?

I can't seem to find a reference. I am assuming the PHP function file_exists uses system calls on linux and that these are safe for any string that does not contain a \0 character, but I would like to be sure.
Does anyone have (preferably non-anecdotal) information regarding this? Is is vulnerable to injection if I don't check the strings first?

I guess you need to, because the user may enter something like :
../../../somewhere_else/some_file and access a file that he is not allowed to access .
I suggest that you generate the absolute path of the file independently in your php code and just get the file name from user by basename()
or exclude any input containing ../ like :
$escaped_input = str_replace("../","",$input);

It depends on what you're trying to protect against.
file_exists doesn't do any writing to disk, which means that the worst that can happen is that someone gains some information about your file system or the existence of files that you have.
In practice however, if you're doing something later on with the same file that was previously checked with file_exists, such as includeing it, you may wish to perform more stringent checks.
I'm assuming that you may be passing arbitrary values, possibly sourced from user input, into this function.
If that is the case, it somewhat depends on why you actually need to use file_exists in the first place. In general, for any filesystem function that the user can pass values directly into, I'd try to filter out the string as much as possible. This is really just being pedantic and on the safe side, and may be unnecessary in practice.
So, for example, if you only ever need to check the existence of a file in a single directory, you should probably strip out directory delimiters of all sorts.
From personal experience, I've only ever passed user input into a file_exists call for mapping to a controller file, in which case, I'd just strip out any non-alphanumeric + underscore character.
UPDATE: reading your comments recently added, no there aren't special characters as this isn't executed in a shell. Even \0 should be fine, at least on newer PHP versions (I believe older ones would cut the string before the \0 when sent to underlying filesystem calls).

coding style for HTML i18n/l10n with variables

I have been working on web developement for quite some time now and I have always struggled to find a clean solution for a problem I have encountered during i18n of HTML strings, mostly anchor tags.
First of let me show you a typical problematic example. This is a frequently encountered string in HTML templates:
Welcome to my site. Check out our cool products
you should not miss.
How do I translate this string while still having the following properties:
Dynamic generation of the URL (e.g. using a router)
A translatable string that is as readable as possible (so translators can do it w/o looking at the code)
Because the string contains HTML, I probably want to escape some parts I insert (e.g. the URL), so I don't make myself vulnerable to XSS if this URL contains user input
It should look as good as possible in the code as well
How do you translate your strings when they contain dynamic content and HTML?

When I now want to apply i18n to that string, I probably turn to gettext or a framework function. Since I come from the PHP/Joomla! world, I used JText::_ before, which acts very similar to gettext. In Python I now use Babel. Both share the same problem and probably more languages, too. All code I share here is my way of doing it in Python, more explicitly, in my Mako templates
Of course, the problem is: There is HTML in our string to be translated (and a URL, for that matter). Here are my options, which I will each explain afterwards:
Passing the raw string to gettext
Splitting the text into three bits
Surrounding linked word with variables
Using one variable that gets build seperately
Passing the raw string to gettext
This one seems the first approach one might take, if not aware of the implications.
Approach 1:
_('Welcome to my site. Check out our cool products \
you should not miss.')
For this msgid you could now translate it, keeping the HTML intact.
Advantages:
This looks very clean in the code and is easy to understand
If the translator is keeping the HTML intact this does not produce any problems
Disadvantages:
The translator has to know at least a little HTML
The string is completely unflexible, e.g. if the URL changes, all translations have to be adjusted
It does not allow for dynamic generation of the URL using something like a router
So as a conclusion, while I used this I quickly hit my limit. My next idea was:
Splitting the text into three bits
Approach 2:
_('Welcome to my site. Check out our cool ') + '<a href="/products">' +\
_('products') + '</a>' + _(' you should not miss.')
Advantages:
The URL is completely flexible now
Only actual text for the translators
Disadvantages:
Splits a sentence into three parts
Translator has to know which parts relate together or he might not be able to produce meaningful sentences
Not very pretty in code
The msgid may be a single word, which can cause problems (beware of contexts) but can be fixed.
I used this technique for some time because I did not know about printf style strings in PHP (which I used back then). Because this looked so ugly, I tried a different approach:
Surrounding linked word with variables
Approach 3:
_('Welcome to my site. Check out our cool %sproducts%s you should not miss.' % \
('', '')
Advantages:
Single string to translate, a complete sentence
Translator gets the context right from the string
Code is not that ugly
Disadvantages:
Translator has to take care that no %s goes missing (might be confusion as it reads like sproducts)
Introduces two format string variables for every URL, one being only </a>
Using one variable that gets build seperately
From here I had some different approaches, but I finally came of with the one I currently use (which might look like overkill, but I perfer it for now).
Approach 4:
_('Welcome to my site. Check out our cool %s \
you should not miss.') % ('%s' % ('/products', _('products')))
Let me take some time to reason this (seemingly lunatic) approach. First of all, the actual translation string looks like this:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.')
Which leaves a translator with the information what is inserted there (that's the translationstring version). Second, I want to ensure that I can manually escape all parts that are inserted into the HTML. While Mako provides automatic escaping, this does not make sense in a statement like this:
${'This is a url'}
It would destroy the url so I have to apply the |n filter to remove any escaping. However, if any argument of that is user supplied, it also opens up to XSS which I want to prevent. Not taking any risk, I can just escape any input (the same way good template engines do by defualt) and then remove Mako's escaping for this one string. So
'%s' % ('/products', _('products'))
actually looks like
'%s' % (escape('/products'), _('products'))
where escape is imported from markupsafe (see Markupsafe).
The final part now is dynamic URLs through a router: request.route_url('products_view')
To combine each of these possibilities, I have to produce something very ugly (note that this uses the mapping keyword argument of translationstring (translationstring.TranslationString) but that combines all the benefits I want/need from translation:
Final result:
_('Welcome to my site. Checkout our cool ${product_url} \
you should not miss.', mapping={'product_url': '%s' %\
(escape(request.route_url('products_view')), _('products'))})
Advantages:
Full HTML escpaing
Fully dynamic
Very good msgids for translation
Disadvantages:
An extremely ugly construct in the template (or the program anyway)
The lingua extractor doesn't catch _('products') so we have do extract that manually
So that is it, this concludes my approaches to this problem. Maybe I am doing something way to complicated and you have a lot better ideas or maybe this is a problem that depends on specific types of translatable text (and one has to choose the right approach).
Did I miss any solution or anything that would improve my approach?

Go back up a line in a linux console?

I know I can go back the line and overwrite its contents with \r.
Now how can I go up into the previous line to change that?
Or is there even a way to print to a specific cursor location in the console window?
My goal is to create some self-refreshing multiline console app with PHP.

Use ANSI escape codes to move the cursor. For example: Esc [ 1 F. To put the Escape character in a string you'll need to specify its value numerically, for example "\x1B[1F"
As sujoy suggests, you can use PHP ncurses for a more abstract way to move the cursor.
Whilst most "consoles" allow ANSI escape codes, other sorts of terminal use different character sequences, ncurses provides a standardised API that is terminal independent. Have a quick look at /etc/termcap (and then man terminfo) if you are interested.
Update: Lars Wirzenius' answer has a useful summary of the background. Some years ago I also wrote a short article on terminals.

The Linux virtual consoles emulate an old-time display terminal, although not perfectly. See Wikipedia on VT-100 for an example of the hardware.
These terminals read data from a serial port, and displayed it on the screen. They also looked for special bytes in the input stream from the serial port and acted upon them in other ways. For example, the newline character ('\n', byte value 10) would go to the beginning of the next line, and the carriage return character ('\r', byte value 13) would go the beginning of the current line.
More interestingly, an ASCII ESC byte (27) would start a command sequence which could to almost anything to the cursor or display. One such sequence might move the cursor to the top left of the screen, another to a given row and column. A third one might clear the screen, and a fourth one might make text be displayed in reverse colors.
Every manufacturer of terminals would invent their own command sequences (and they didn't always start with ESC either), and then change them depending on what they could make new versions of their hardware do. If a manufacturer added colors or simple graphics, those resulted in new sequences.
Adapating every application to every terminal and every change to the command sequences would have been a big task. Compare it with adapting every web application to a new browser version.
As usual, the solution is to add a layer of abstraction. In Unix, the initial abstraction was called termcap, and consisted of the file /etc/termcap, and a library to read the file. The file would specify the actual command sequences to send for each logical operation for each terminal model. So a vt102 terminal model would map the operation "clear the screen" to the \033[2J. This allowed application programmers to think in terms of the logical operations, which was much simpler.
Of course, not simple enough... The termcap library was not as good as it might have been, so two other libraries were developd: curses provided a higher abstraction level, including user input, and terminfo made the terminal definitions and their use by programmers easier.
In modern times, ncurses is a free re-implementation of curses and terminfo has pretty much replaced termcap completely. Also, ANSI has defined some "standard" sequences, based on the Digital terminals, and almost every terminal emulator uses those, at least mostly, and the Linux virtual console is one of them. Very few people have actual physical terminals anymore.
For what you're trying to do, ncurses or the tput command may be most useful. Or you may decide that just clearing the whole screen (see clear(1)) and writing output then is easiest.

My goal is to create some self-refreshing multiline console app with
PHP
For what you are trying to achieve ncurses is the way to go.

You shoud read about ncurses. In shell, you can go one line up by:
tput cuu1
See man terminfo for more options.
But executing shell command to move cursor around is quite desperate.

You just you the up and down arrows on the keyboard to scroll through console history but there is also the history command. Find out more using man history

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.