PHP - Securing a users full HTML file against XSS

PHP - Securing a users full HTML file against XSS - php

I'm currently working on an email template building website in PHP (LAMP to be specific) that allows users to paste in their HTML email code and then send it off to their customers.
Obviously with handling this kind of data I need to implement some kind of XSS security. I've scowled the net for weeks trying to find solutions to this and found very few good methods but they don't really work for full HTML documents (which is what I'd be dealing with).
These are the solutions I found and why they don't work for me:
HTMLPurifier:
I think this is the obvious choice for most because it's got the best security and is up to date with industry standards. Although it's main use is supposed to be for HTML fragements/small snippets, I thought I'd give it a go.
The first issue I ran into was that the head tags (and anything inside them) was getting stripped and removed. The head is quite essential in HTML emails so I had to find a way around this...unfortunately, the only fix I could find was to seperate the head from the rest of the email and run each part seperately though HTMLPurifier.
I've yet to try this because it seems very hacky but it seems to be the only way to achieve what I'm after. I'm also not sure on how well HTMLPurifier is at finding XSS in CSS. On top of all that, it doesn't do well in terms of performance with it being such a large library.
HTMLawed:
HTMLawed seemed to be another great option but a few things swayed me from using it.
A) Compared to HTMLPurifier, this seems to be less secure. HTMLawed has several documented security issues at the moment. It's also not widely used yet which is more worrying (only used by about 10 registered companies).
B) It's released under the GPL/GPU License, which effectively means I can't use it on my website unless I'm willing to let people use my service for free.
C) From what I've seen of people talking about it, it seems to strip a lot of tags unless it's heavily configured. I can't have much say here because I've not tried it but that also raises security concerns for me - what if I miss something? what if I can't configure it to keep the elements I want? etc.
These are my questions to you:
Are there any better alternatives to the ones listed above?
Is it possible to code this myself or is that too ambitious and too insecure?
How do the larger email companies tackle this issue (mailchimp, activecampaign, sendinblue, etc.)?

It seem you are sending an HTML content. So then you cannot filter them. You must store HTML in your database. If you filter them using XSS proof, then the HTML will not working properly. By default, all Webmail service disabling Javascript by default like GMail, Yahoo, Roundcube etc.
If you are using WYSIWYG like CKEditor, it automatically remove all <script> tags and also certain unknown attribute. But still you can set it to what to accept and what to remove via CKEditor.config().
If you PHP cannot insert into your database because of some special chars, then you can use SQL prepare statement or encode your HTML input to base64 using base64_encode() then decode it when to use in mail() or PHPMailer::Body.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

Clean HTML from Word documents

Ok, so my company has a client that has an interface for posting content - standard MySQL database, PHP-based, etc.
Anyway, they've continually had an intern or someone, post content to this interface straight from an MS Word doc - the interface is coded poorly, and takes this input as is, with no formatting.
My company has now been contracted out to fix this particular problem, as it is continually breaking their site, and my company has repeatedly had to manually go into the database, and delete the offending values.
Is there a quick and easy way to do this, or am I going to have to just do a replace operation on each offending character?
I see htmlentities() may be a partial solution - but as far as I know, that won't remove everything.
What's a good solution to this problem? Is there anything out there to make this easier?
We're also considering writing a content validator as well, probably just server-side (though maybe client-side, if my week is going slowly enough/I finish the rest of this quickly enough).

It depends on how many clients (or potential clients) you are supporting and how much time you have to invest. Options
Write your own function to strip out the metadata
Teach your clients to remove it themselves such as paste in notepad first,
or supply a knowledge base article to explain how to do it in the software. Perhaps a "Help" section or icon they can click on.
htttp://support.microsoft.com/default.aspx?scid=kb;en-us;223396
Use a WYSIWYG editor such as TinyMCE which has built in functionality to remove it
But like I said in the comments, unless you are using your own function, prepare for clients to continue to paste directly and wonder why there is a problem.

Displaying foreign HTML safely

I have an application that needs to display foreign HTML data (e.g. HTML-encoded email texts, though not only) safely - i.e., remove XSS attempts and other nasty stuff. But still be able to display HTML as it should look like. Solutions considered so far aren't ideal:
Clean HTML with something like HTMLPurifier. Works fine, but once email size goes over 100K it becomes very slow - tens of seconds per email. I suspect any secure enough parser would be as slow in PHP - some emails are really bad HTML, I've seen some that generate 150K HTML for one page of text.
Display HTML in an iframe - here the problem is that iframe needs then to be in another origin to be safe from XSS AFAIK, and this would require different domain for the same app. Setting up application with two domains is much more work and may be very hard in some setups (such as hosting that gives only one domain name).
Any other solutions that can achieve this result?

From my understanding, I don't believe so.
The trouble is that you can only safely remove HTML tags if you understand its structure, and 'understanding its structure' is exactly what parsing is. Even if you find a different way to analyse the structure of HTML and don't call it parsing, that's what you're doing, and it's bound to be some form of slow (or unsafe).
What you could do is play around with a few preliminary filters (e.g. strip_tags, which is generally a good prelim' (if certainly nothing else)) to give the parser less work to do, but whether that's viable depends on the size of your tag whitelist - a small whitelist will probably yield better benchmark results, since a large chunk of HTML would be filtered out by strip_tags before the parser got to it.
Additionally, different parsers work in different ways, and the sort of HTML you deal with frequently may be best suited to one sort of parser over another - HTML Purifier itself even has different parsers at its disposal that you can switch between to see if that results in a better benchmark for you (though I suspect the differences are negligible).
Whether such juggling works for your usecases is something you'll probably have to benchmark yourself, though.
Word of caution: If you decide to pursue it, know I wouldn't go with the iframe approach. If you don't filter HTML, you also allow forms, and it becomes (IMO) trivial in combination with scripts and CSS to set up extremely convincing phishing, e.g. using tricks such as "this e-mail is password protected, to proceed, please enter your password".

One possible solution (and the one that SO uses!) is to only allow certain types of tags. <p> and <br /> are fine, but <script> is right out.

PHP displaying html email in a html page

I'm building an PHP email mailbox script.
How would I make html emails display cleanly as they do in gmail/hotmail.
If I just echo it out it affects the whole page layout.
I could use iframes but surely that isn't the best solution.

If you are looking for the 'best solution' get on board with another open source email library that is doing the same thing you are. Maintaining an email renderer on your own that is safe against script injection and other hacks will simply be too much work for one person.
One example: https://github.com/afterlogic/webmail-lite
Another: http://trac.roundcube.net/
You get the benefit of other developers who use the library maintaining the code base, so if something is broken, all you have to do is pull the latest update (hopefully) and you get the fix. If you find something that needs improving, you can fix it or build it, and make the code better for everyone. I'm really just pitching open source libraries here, however in any commercial context, building your own email renderer without a big team, is a bad idea.

As Marc B stated, I believe an IFrame would be your best bet... but please realize that if you just dump any email HTML code you risk exposing yourself to viruses, Trojans, and malicious HTML/JavaScript code - Your opening Pandora's box on your computer unless you find a good way to sandbox/strip that HTML.
Here's a simple Regex to clean JavaScript at least :
"(?s)<script.*?(/>|</script>)"

Consider the use of some HTML Tidy library (i.e.: PHP.Tidy).
You can pass the text through the library to get well formatted html.
A good practice would be to define a CSS standard behaviour for most tags in the div you're using.

Create a DIV container that you assign width (and height if needed) to, and make sure you add an overflow property to match your design. This should keep your email HTML from interfering with your layout.
UPDATE
A DIV container still assures you that you can constrain the size of the display box and with appropriate CSS acts similar to an iframe without all the baggage.
If you are worried about the code in the email, strip_tags would seem a better solution than the regex. You can define a list of tags to leave alone and still be confident of stripping the rest.

Xhtml instead of Php?

I want to develop a site that will allow be to publish information to users, and give them and opportunity to subscribe to a mailing list so they can be updated each time I make a change to the site.
*Add new information, etc.
I also would like for the users to be able to add comments about reviews posted, and give me suggestions...Things that will encourage user interaction
I understand that this is possible with php...
But I do not know php, and to learn and test it I apparently need a domain to begin with...etc.
Is it possible that I use Xhtml/Html to get the same results?
--
I know I can use the
Mail
but that would also leave my email open to spam...Any suggestions?
And I do apologize if this question has been posted before, I did some research and found no such thing.
All helpful responses are appreciated.

XHTML and HTML are essentially the same thing, just xhtml is based on an xml standard (thats where the x comes from), therefore being a bit more stricter.
HTML/XHTML is generally used for structure of your webpage, where as PHP is a server based language, meaning it works behind the scenes.
You could use html, but it'd be hideously complex to make, so i'd say you'd be better of biting the bullet and making a start on your first php app:) Don't worry it's very easy to get your head around. You do not need a domain to get started with the development, simply install WAMP (for windows), or MAMP (if your apple freak like me), these programs act as self contained mini servers, very useful for development!
Then i'd suggest trying it all out using html for starters, just so you get used to the WAMP/MAMP sever, before heading over to http://devzone.zend.com/article/627 for a brilliant set of tutorials on PHP!
EDIT: Another poster mentioned wordpress, its a great platform too! But i always favour learning the basics so in the event of something going wrong, or not working the way you want it to, you'll know what to do, or at least have an idea. Therefore i'd stick with your own php solution as a starter, then progressing to wordpress, when you feel comfortable.
I hope this helps :)

(X)HTML is the markup language that's interpreted by the browser, to display your web pages.
PHP is a language, used on the server, that can :
Generate that HTML markup
Act as a 'glue' with other systems, such as a database, for data-persitence.
(X)HTML by itself it not dynamic : it's only used to display data.
And PHP by itself doesn't display much information : it generates them.
So, basically, you'll need to use both (X)HTML and PHP :
PHP for everything thats' dynamic
like interaction with a database, a form, ...
HTML (possibly generated by the PHP code) to display the data.

No, you will need some kind of server side scripting language to be able to interrogate a database, print out comments and send the generated HTML to the browser.
If you don't know how to use PHP, how about using an open source solution like WordPress, this is a bloging platform but offers all the things you listed.

I would suggest using WordPress because:
It is easy to learn, the documentation is excellent
There are thousands of free plugins to add functionality to your site
There is a plugin, Contact Form 7, that will allow your users to send your email while doing a good job of curbing spam
There is a built in RSS feed to push out to your users notices when your site is updated
WordPress can be installed on shared hosting, virtual private hosts, and almost any machine with the LAMP stack
If you are new to creating websites, WordPress has free themes which are a good starting place
Finally, to answer your question, XHTML and PHP do different things. XHTML is like the idea of a picture. You can see it, it has shapes, outlines, sometimes words, etc. Where as PHP is like film where viewers can see something, but there is something in the background that is updating and moving.

HTML is just a markup language used by the browser to format data to display to users.
Most hosting solutions provide form mailer scripts that just take an HTML form and email the fields to a specified email address which you can configure.
They also provide mailing list functionality.
So, maybe check for a (PHP) hosting solution that provide this functionality and you won't need to write any PHP until you require more complex, custom functionality.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.