Displaying foreign HTML safely

Displaying foreign HTML safely - php

I have an application that needs to display foreign HTML data (e.g. HTML-encoded email texts, though not only) safely - i.e., remove XSS attempts and other nasty stuff. But still be able to display HTML as it should look like. Solutions considered so far aren't ideal:
Clean HTML with something like HTMLPurifier. Works fine, but once email size goes over 100K it becomes very slow - tens of seconds per email. I suspect any secure enough parser would be as slow in PHP - some emails are really bad HTML, I've seen some that generate 150K HTML for one page of text.
Display HTML in an iframe - here the problem is that iframe needs then to be in another origin to be safe from XSS AFAIK, and this would require different domain for the same app. Setting up application with two domains is much more work and may be very hard in some setups (such as hosting that gives only one domain name).
Any other solutions that can achieve this result?

From my understanding, I don't believe so.
The trouble is that you can only safely remove HTML tags if you understand its structure, and 'understanding its structure' is exactly what parsing is. Even if you find a different way to analyse the structure of HTML and don't call it parsing, that's what you're doing, and it's bound to be some form of slow (or unsafe).
What you could do is play around with a few preliminary filters (e.g. strip_tags, which is generally a good prelim' (if certainly nothing else)) to give the parser less work to do, but whether that's viable depends on the size of your tag whitelist - a small whitelist will probably yield better benchmark results, since a large chunk of HTML would be filtered out by strip_tags before the parser got to it.
Additionally, different parsers work in different ways, and the sort of HTML you deal with frequently may be best suited to one sort of parser over another - HTML Purifier itself even has different parsers at its disposal that you can switch between to see if that results in a better benchmark for you (though I suspect the differences are negligible).
Whether such juggling works for your usecases is something you'll probably have to benchmark yourself, though.
Word of caution: If you decide to pursue it, know I wouldn't go with the iframe approach. If you don't filter HTML, you also allow forms, and it becomes (IMO) trivial in combination with scripts and CSS to set up extremely convincing phishing, e.g. using tricks such as "this e-mail is password protected, to proceed, please enter your password".

One possible solution (and the one that SO uses!) is to only allow certain types of tags. <p> and <br /> are fine, but <script> is right out.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

How do I handle the situation when I allow safe external HTML elements but they are not valid, thus still breaking my webpage?

I fetch HTML blobs from external sources, for example RSS feeds, and display in my little localhost web control panel.
For security, I use strip_tags on this content and only allow a small number of "safe" HTML elements, such as <p> and <div>.
Up until today, it has worked as expected.
But today, I got such a HTML blob which had an error: it contained a <div> but not </div>, meaning that it ended up messing with my own webpage.
Simply disallowing <div> is not a solution as it would make the content a mess, and of course they could do the same thing with <p> as well or any other allowed element.
I do have a means to validate HTML strings locally, but it takes like two seconds each time, so it's going to add a lot of extra slow-downs. However, if no other solution can be found, I will do this. (That is, if it's not valid, I just strip all HTML elements and let it become a mess/blob rather than breaking my webpage.)
I should of course have predicted that sooner or later, some external source from where I fetch data would have broken HTML. It's a wonder it didn't happen earlier. Still, there is no obvious solution which preserves some kind of structure/markup and is guaranteed to not mess with my page.
I do use an iframe for displaying HTML e-mails, but there are problems with that approach and it basically makes the page annoying to navigate and deal with, so I try to avoid that "solution" for this.

PHP - Securing a users full HTML file against XSS

I'm currently working on an email template building website in PHP (LAMP to be specific) that allows users to paste in their HTML email code and then send it off to their customers.
Obviously with handling this kind of data I need to implement some kind of XSS security. I've scowled the net for weeks trying to find solutions to this and found very few good methods but they don't really work for full HTML documents (which is what I'd be dealing with).
These are the solutions I found and why they don't work for me:
HTMLPurifier:
I think this is the obvious choice for most because it's got the best security and is up to date with industry standards. Although it's main use is supposed to be for HTML fragements/small snippets, I thought I'd give it a go.
The first issue I ran into was that the head tags (and anything inside them) was getting stripped and removed. The head is quite essential in HTML emails so I had to find a way around this...unfortunately, the only fix I could find was to seperate the head from the rest of the email and run each part seperately though HTMLPurifier.
I've yet to try this because it seems very hacky but it seems to be the only way to achieve what I'm after. I'm also not sure on how well HTMLPurifier is at finding XSS in CSS. On top of all that, it doesn't do well in terms of performance with it being such a large library.
HTMLawed:
HTMLawed seemed to be another great option but a few things swayed me from using it.
A) Compared to HTMLPurifier, this seems to be less secure. HTMLawed has several documented security issues at the moment. It's also not widely used yet which is more worrying (only used by about 10 registered companies).
B) It's released under the GPL/GPU License, which effectively means I can't use it on my website unless I'm willing to let people use my service for free.
C) From what I've seen of people talking about it, it seems to strip a lot of tags unless it's heavily configured. I can't have much say here because I've not tried it but that also raises security concerns for me - what if I miss something? what if I can't configure it to keep the elements I want? etc.
These are my questions to you:
Are there any better alternatives to the ones listed above?
Is it possible to code this myself or is that too ambitious and too insecure?
How do the larger email companies tackle this issue (mailchimp, activecampaign, sendinblue, etc.)?

It seem you are sending an HTML content. So then you cannot filter them. You must store HTML in your database. If you filter them using XSS proof, then the HTML will not working properly. By default, all Webmail service disabling Javascript by default like GMail, Yahoo, Roundcube etc.
If you are using WYSIWYG like CKEditor, it automatically remove all <script> tags and also certain unknown attribute. But still you can set it to what to accept and what to remove via CKEditor.config().
If you PHP cannot insert into your database because of some special chars, then you can use SQL prepare statement or encode your HTML input to base64 using base64_encode() then decode it when to use in mail() or PHPMailer::Body.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...

You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.

I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.

Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

How much time should I put into validating my HTML and CSS?

I'm making some pages using HTML, CSS, JavaScript and PHP. How much time should I be putting into validating my pages using W3C's tools? It seems everything I do works but provides more errors according to the validations.
Am I wasting my time using them? If so any other suggestions to make sure I am making valid pages or if it works it's fine?

You should validate your code frequently during development.
E.G after each part/chunk revalidate it, and fix errors immediately.
that approach speeds up your learning curve, since you instantly learn from small mistakes and avoid a bunch of debugging afterwards, that will only confuse you even more.
A valid page guarantees you that you don't have to struggle with invalid code AND cross-browser issues at the same time ;)
use:
Markup Validation Service for HTML
CSS Validation Service for CSS
JSlint for Javascript
jQuery Lint for jQuery
Avoid writing new code until all other code is debugged and fixed
if you have a schedule with a lot of bugs remaining to be fixed, the schedule is unreliable. But if you've fixed all the known bugs, and all that's left is new code, then your schedule will be stunningly more accurate.(Joel test rule number 5)

Yes, you should definitely spend some time validating your HTML and CSS, especially if you're just starting learning HTML/CSS.
You can reduce the time you spent on validating by writing some simple scripts that automatically validates the HTML/CSS, so you get immediate feedback and can fix the problems easily rather than stacking them up for later.
Later on tough, when you're more familiar with what is and what is not valid HTML and CSS, you can practically write a lot without raising any errors (or just one or two minor ones). At this stage, you can be more relaxed and do not need to be that adverse about checking every time, since you know it will pass anyway.
A big no-no: do not stack the errors up, you'd be overwhelmed and would never write the valid code.

Do you necessarily need to get to zero errors at all times? No.
But you do need to understand the errors that come out of the validator, so you can decide if they're trivial or serious. In doing so you will learn more about authoring HTML and in general produce better code that works more compatibly across current and future browsers.
Ignoring an error about an unescaped & in a URL attribute value is all very well, until a parameter name following it happens to match a defined entity name in some browser now or in the future, and your app breaks. Keep the errors down to ones you recognise as harmless and you will have a more robust site.
For example let's have a look at this very page.
Line 626, Column 64: there is no attribute "TARGET"
… get an OpenID
That's a simple, common and harmless one. StackOverflow is using HTML 4 Strict, which controversially eliminates the target attribute. Technically SO should move to Transitional (where it's still allowed) or HTML5 (which brings it back). On the other hand, if everything else SO uses is Strict, we can just ignore this error as something we know we're doing wrong, without missing errors for any other non-Strict features we've used accidentally.
(Personally I'd prefer the option of getting rid of the attribute completely. I dislike links that open in a new window without me asking for it.)
Line 731, Column 72: end tag for "SCRIPT" omitted, but its declaration does not permit this
document.write('<div class=\"hireme\" style=\"' + divstyle + '\"></div>');
This is another common error. In SGML-based HTML, a CDATA element like <script> or <style> ends on the first </ (ETAGO) sequence, not just if it's part of a </script> close-tag. So </div> trips the parser up and results in all the subsequent validation errors. This is common for validation errors: a basic parsing error can easily result in many errors that don't really make sense after the first one. So clean up your errors from the start of the list.
In reality browsers don't care about </ and only actually close the element on a full </script> close-tag, so unless you need to document.write('</script>') you're OK. However, it is still technically wrong, and it stops you validating the rest of the document, so the best thing to do is fix it: replace </ with <\/, which, in a JavaScript string literal, is the same thing.
(XHTML has different rules here. If you use an explicit <![CDATA[ section—which you will need to if you are using < or & characters—it doesn't matter about </.)
This particular error is in advertiser code (albeit the advertiser is careers.SO!). It's very common to get errors in advertiser code, and you may not even be allowed to fix it, sadly. In that case, validate before you add the broken third-party markup.

I think you hit a point of diminishing returns, certainly. To be honest, I look more closely and spend more effort on testing across browser versions than trying to eliminate every possible error/warning from the W3C validators; though I usually try to have no errors and warnings related only to hacks I had to do to get all browsers to work.
I think it's especially important to at least try, though, when you are using CSS and Javascript particularly. Both CSS and Javascript will require things to be more 'proper' for them to work correctly. Having well-formed (X)HTML always helps there, IMO.

Valid (X)Html is a really important.
not only does it learn you more about html and how the name spaces work but it also makes it easier for the browser to render and provides a more stable DOM for javascript.
Theres a few factors in the pro's and cons of validation, the most important con of validation is speed, as keeping your documents valid usually uses more elements it makes your pages slightly larger.
Companies like Google do not validate because of there mission statments, in being the fastest search engine in the world, but just because they don't validate don't mean they do not encourage it..
A Pro of validating your html is providing your bit to the WWW, in regards to better code = faster web..
heres a link with several other reasons why you should validate..
http://validator.w3.org/docs/why.html
I always validate my sites so that there 100% and i am happy with the reliabity of the documents.

You're not wasting your time, but I'd say don't fuss over minor warnings if you can't avoid them.
If your scope is narrow (modern desktop web browsers only: IE, Firefox, Chrome, Opera), then just test your site on all of them to make sure it displays and operates properly on them.
It will pay out in the future if your code validates fully, mostly by avoiding obvious pitfalls. For example, if IE doesn't think a page is fully valid then it goes into quirks mode, which makes some elements display exactly like that: quirky).

Although W3C writes the official specifications for HTML, CSS, etc, the reality is that you can't treat them as the absolute authority on what you should do. That's because no browser perfectly implements the specifications. Some of the work you would need to do to get 100% valid documents (according to the W3C tools) will have no effect on any major browser, so it is questionable whether it is worth your time.
That's not to say it isn't important to validate pages — it is important to use the tools, especially when you are learning HTML — but it is not important (in my opinion) to achieve 100%. You will learn in time which ones matter and which ones you can ignore.
There is a strong argument that what matters is what browsers actually do. While there are many good arguments for standards, ultimately your job is to create pages that actually work in browsers.
The book Dive into HTML 5 has an excellent chapter on the tension between standards and reality in HTML. It's lengthy but a very enjoyable read! http://diveintohtml5.ep.io/past.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.