Twig, output escaping and syntax highlighting JS plugins

Twig, output escaping and syntax highlighting JS plugins - php

I'm struggling with following issue: I want to include some code examples on my page. They are mostly php, but also html and js.
My best option is to use some of js-based syntax highlighters. I choose SyntaxHighlighter, because so many people recommend this on SO and other sites.
But what about output escaping in twig? Of course default escaping is causing code to show beatifully escaped, but this doesn't work properly. Using |raw to make this work results in other - obvious - issue: it breaks page if html is presented, or breaks scripts if javascript is presented.
Another issue is that all the outputs that I need escaped are mixed, eg:
(some text, with html formatting)
<code class="someclass">
(block of code)
</code>
(some another text)
<code class="anotherclass">
(another block of code)
</code>
Having all these fact I thought - let's write our own filter for Twig! That sounds great, but even if I was able to make it run, I coudn't make it work the way I want.
They I thought - why should I reinvent the wheel? Twig and Symfony2 are out for years, probably someone else already solved this problem, and did it good, secure way.
I'm looking for one of four things:
Custom twig filter to handle this issue, or
Better syntax coloring script, that will handle output escaped by Twig, or
Some other solution, or
Any useful hint.

Have a look at this manual:
http://isometriks.com/geshi-symfony2-and-twig-extensions
It worked for me.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

Is it possible to prevent standard HTML comments from showing up in source code?

I'm going to assume the answer is 'no' here, but since I haven't actually found that answer, I'm asking.
Quite basically, all I want to do is leave some HTML commenting in my files for 'author eyes only', simply to make editing the file later a much more pleasant experience.
<!-- Doing it like this --> leaves nice clean comments but they show up when viewing the page source after output.
I am using PHP, so technically I could <?PHP /* wrap comments in PHP tags */ ?> which would prevent them from being output at all, but if possible I'd like to avoid all of the extra arbitrary PHP tagging that would be needed for commenting throughout the file. After all, the point of commenting is to make the document feel less cluttered and more organized.
Is there anything else I could try or are these my best options?

No, anything in html will show up.
You could, have a script that parses the code, and removes the comments, before it puts it up on the server, and then you would have the original, and the uncommented source.
A tool to accomplish this:
http://code.google.com/p/htmlcompressor/

I guess these are your best options, yes, unless you run the entire HTML output through some sort of cleanup module before being sent to the client.
Anything not wrapped in server side syntax will will be output to the client if not modified on its way out (through template engines, for example). This goes for most (probably all) server side languages).

You could definitely write a parser that uses regex to strip out HTML comments, but unless you're already dealing with a roll-your-own CMS, most likely the work involved in this far outweighs the benefits of not using PHP comments as you suggested.

Escape from XSS vulnerability maintaining Markdown syntax?

I'm planning to use Markdown syntax in my web page. I will keep users input (raw, no escaping or whatever) in the database and then, as usual, print out and escape on-the-fly with htmlspecialchars().
This is how it could look:
echo markdown(htmlspecialchars($content));
By doing that I'm protected from XSS vulnerabilities and Markdown works. Or, at least, kinda work.
The problem is, lets say, > syntax (there are other cases too, I think).
In short, to quote you do something like this:
> This is my quote.
After escaping and parsing to Markdown I get this:
> This is my quote.
Naturally, Markdown parser do not recognize > as “quote's symbol” and it does not work! :(
I came here to ask for solutions to this problem. One idea was to:
First, parse to Markdown, — then with HTML Purifier remove “bad parts”.
What do you think about it? Would it actually work?
I'm sure that someone had have the same situation and the one can help me too. :)

Yes, a certain website has that exact same situation. At the time I'm writing this, you have 1664 reputation on that website :)
On Stack Overflow, we do exactly what you describe (except that we don't render on the fly). The user-entered Markdown source is converted to plain HTML, and the result is then sanitized using a whitelist approach (JavaScript version, C# version part 1, part 2).
That's the same approach that HTML Purifier takes (having never used it, I can't speak for details though).

The approach you are using is not secure. Consider, for instance, this example: "[clickme](javascript:alert%28%22xss%22%29)". In general, don't escape the input to the Markdown processor. Instead, use Markdown properly in a safe mode, or apply HTML Purifier or another HTML sanitizer to the output of the Markdown processor.
I've written elsewhere about how to use Markdown securely. See the link for details about how to use it safely, but the short version is: it is important to use the latest version, to set safe_mode, and to set enable_attributes=False.

what is the recommended way to embed html codes inside php codes?

Lets say I have 2 cases:
<?php
print("html code goes here");
?>
v.s
<?php
?>
html codes goes here
<?php
?>
Would the performance for PHP interpreter in the first case be worse than the second one? (Due to the extra overhead of processing inside print function).
So, does anyone have a recommended way to insert html codes inside php codes?

Oh, for the sake of all those who edit your code later, please never put any significant amount HTML code inside a string and echo it out. In any language.
HTML belongs in HTML files, formatted and styled by your IDE or editor, so it can be read. Giant string blocks are the biggest cause of HTML errors I have ever seen.
Performance shouldn't matter too much, in this case, but I would assume the second would be faster, because it is streamed directly to the output or buffer.
If you want it to be easier to read, enable short tags, and write it like this:
?><b>blah blah blah</b><?
Plus, with short tags enabled, it's easier to echo out variables:
Hello, <?= $username ?>
If you are using this to generate some sort of reusable library, there are other options.

You should put HTML outside of PHP code in order for better maintenance and scalability. It's also very beneficial to do all your necessary data processing before displaying any data, in order to separate logic and presentation.
Rather then try to think about constantly separating your php and HTML you should instead be in the mind set of separating your backend logic and display logic.
The MVC pattern is a good way of thinking about your code - In order to correctly use PHP you must use MVC (model-view-controller) pattern

Never put HTML inside PHP codes unless you specifically intend to do so or its very small. But then again 100% separation is what i recommend. People will have to work very hard to understand your code later if you mix them up. Especially designers who may not be comfortable with php.
The golden rule is separation of the front and back end process to the maximum helps in every aspect. Keep things where they belong. Styles in CSS, Java-scripts in JS, Php in a library folder/files and just use the required classes/functions.
Use short tags <? if required (but i dont like it :P ) also <?= tag for output echo. Besides short tags are better be avoided.

Don't do it that way at all! Use a templating system like Smarty so you can separate your logic from your display code. It also allows you to bring in a designer that can work with html and might not know php at all. Smarty templates read more like HTML than PHP and that can make a big difference when dealing with a designer.
Also, you don't have to worry about your designer messing with your code while doing a find/replace!
Better yet would be to go to a setup like Django or Rails that has clearly delineated code/template setup, but if you can't do that (cheap hosting solutions) then at least use templating. I don't know if smarty is the best thing for PHP, but its far better than both solutions you are proposing.

[head above the parapet] Many of us have learnt templating from WordPress where without embedding php it's virtually impossible to do anything. I can quite understand why people advocate strict MVC or engines such as Smarty but the fact is in the case of WordPress development you need to manipulate output on the fly with php. In fact, coming from that background, I always use to assume that the 'hp' in php was for exactly that reason. So I could write 'normal' looking HTML, do a bit of server-side processing and then return to HTML.
So, from my point of view, the answer to your question, is the second of your examples is much easier to read - one of the fundamentals of elegant coding. But it does depend. If there's a lot of processing to produce a simple piece of html then it may be easier to build a large variable and echo it at the end. I abhore multiple lines of echo statements. In this case I am likely to use a function to keep my HMTL clean. Again WordPress does this a lot; for instance the_title() returns a simple string but does a deal of processing before returning this string so <h1><? the_title(); ?> </h1> reads well.
That is the POV of a WordPress developer who was never formally taught complex coding. I expect to lose a fair amount of reputation points over this answer. [/head above the parapet]

PHP to clean-up pasted Microsoft input

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.
I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).
Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.
Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).
For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.
If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

In my case, this worked just fine:
$text = strip_tags($text, '<p><a><em><span>');
Rather than trying to pull out stuff you don't want such as embedded word xml, you can just specify you're allowed tags.

The website http://word2cleanhtml.com/ does a good job on converting from Word. I'm using it in PHP by scrapping, to process some legacy HTML, and until now it's working pretty fine (the result is very clean <p>, <b> code). Of course, being an external service it's not good to use it in online processing like your case.
If you try it and it brings many 400 errors, try filtering the HTML with Tidy first.

In my case, there was a pattern. The unwanted part always started with
<!-- [if gte mso 9]>
and ended by an
<![endif]-->
So my solution was to cut out everything before and after this block:
$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.