Using HTML Purifier to stop links to own site - php

I have used HTML purifier to weed out any suspect stuff coming in from my public facing WYSIWYG editor. The incoming HTML is also displayed in the public portion of the website.
I have allowed links, and I also automatically linkify URLs in plain text (using the purifier).
Is there a way to allow external links, but ban links to the same domain? E.g my domain is www.example.com
http://www.google.com will be linked.
http://www.example.com/logout/ will not be linked.
I am looking at minimizing any interference from malicious users. Should I just make my logout link a form action with a POST key/value pair to stop this from happening?
Thanks

Your login/out form should ALWAYS be POST-only.
Don't worry about a verification value, but this is a pretty important security issue - any transactions which change the state of the webserver should be POST requests. You should NEVER allow http://example.com/object?action=delete, or any variant thereof. PHP encourages bad practice in this matter, but you should ALWAYS use one or the other, and NEVER allow both.
If your users can write forms into your WYSIWYG editor, you've got far bigger problems than this.
To answer your original question, to disable internal links, use URI.HostBlacklist and be sure to set URI.MakeAbsolute:
http://htmlpurifier.org/live/configdoc/plain.html#URI.HostBlacklist

Related

PHP Securely view untrusted email content in browser

What would be the recommended way to securely view emails in a browser (in PHP)?
Emails are highly insecure content and desktop email software obviously implements only a very limited subset of HTML and no javascript at all to prevent attacks. But if I'd take an email HTML source and display it in a browser, javascript code and other stuff would be executed.
I thought a solution would be to send a header like this along with the email source:
header("Content-Security-Policy: sandbox");
But this would prevent me from fetching inline images from the server as I still would need a PHP session id to be transmitted to understand that the user is allowed to fetch this content.
As there are many web email clients out there I wonder if there is a best practice model.
(FYI: I try to implement my own web email tool fitting to specific needs of a larger software suite)
You can address the issue of images by not requiring authentication and then making the URLs hard to guess (ex: <img src="/resources/SomeReallyLongHardToGuessRandomString">).
More broadly though, securely displaying user generated HTML is hard. Like really hard. This is a case where you should use a library. Keep in mind that you might have a user with a browser that is too old for the Content-Security-Policy header. This browser would happily run any scripts on the page. HTML Purifier is my personal choice, but there are others. Also, keep in mind that this is a dependency you will want to update often as people are constantly discovering new bugs.
As an additional line of defense, many sites use a seperate domain for user generated content. For example Google uses googleusercontent.com. That way if something does slip by, they haven't compromised the whole application. Note that this would still be bad, as an attacker might be able to read user content they shouldn't be able to (emails in this case).
I finally decided to modify the HTML source of the email (in my php script) to serve the inline images as base64 encoded data inside the HTML source. Therefore no additional loading of images is needed:
<img src="...">
This will solve the current problem of displaying emails, because then I can stay my
header("Content-Security-Policy: sandbox");
because it is one major way to prevent attacks to be successful. Additionally, for enhanced security, I plan to look again into roundcubemail and see if I find out how they handle this problem and also use HTMLpurifier to further strip the email source from possible threats.

How to store user content while avoiding XSS vulnerabilities

I know similar questions have been asked but I am struggling to work out how to do it.
I am building a CMS, rather primitive right now, but it's as a learning exercise; in a production site, I would use an existing solution for sure.
I would like to take user input, which can be styled in a WYSIWYG editor. I would also like them to be able to insert images inline.
I understand I can store HTML in the database but how can I safely re-render this. I know there is no problem with the HTML being stored but it is my understanding that XSS become an issue if I were to just simply dump the user-generated code onto a layout template.
So the question put simply, is how can I store and safely rerender user content in cms? I am using Laravel and PHP. I also have a little knowledge of javascript if its required.
For a CMS where you want to allow some tags but not others, then you want something like HTML Purifier. This will take HTML and run it against a whitelist and regenerate HTML that is safe to display back to the user.
A good and cheap way to avoid cross-site scripting is to get your php program to entitize everything from your users' input before storing it in the database. That is, you want to take this entry from a user
Hi there sucker! I just hacked your site.
<script>alert('You have been pwned!')</script>
and convert it to this before putting it into your database.
Hi there sucker! I just hacked your site.
<script>alert('You have been pwned!')</script>
When you pass < to a browser, it renders it as <, but it doesn't do anything else with it.
The htmlentities() function can do this for you. And, php's htmlspecialchars_decode() can reverse it if you need to. But you shouldn't reverse the operation unless you absolutely must do so, for example to load the document into an embedded editor for changes.
You can also choose to entitize user-furnished text after you retrieve it from your database and before you display it. If you get to the point where several people work on your code, you may want to do both for safety.
You can also render user-provided input inside <pre>content</pre> tags, which tells the brower to just render the text and do nothing else with it.
(Use right-click Inspect on this very page to see how Stack Overflow handles my malicious example.)

How can I ensure a URL points to safe, non-adult, non-spam content when allowing people to post content to my website?

I am working on a PHP site that allows users to post a listing for their business related to the sites theme. This includes a single link URL, some text, and an optional URL for an image file.
Example:
<img src="http://www.somesite.com" width="40" />
ABC Business
<p>
Some text about how great abc business is...
</p>
The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.
The image file URL is always placed inside a <img src="" /> tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.
The dangerous part is the link.
Question:
How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?
I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.
EDIT:
I am using a CAPTCHA and do require registration before posting is allowed.
Its going to be very hard to try and determine this yourself by scraping the site URL's in question. You'll probably want to rely on some 3rd party API which can check for you.
http://code.google.com/apis/safebrowsing/
Check out that API, you can send it a URL and it will tell you what it thinks. This one is mainly checking for malware and phishing... not so much porn and spam. There are others that do the same thing, just search around on google.
is there a quick and easy way to validate a sites content from a link.
No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.
Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.
If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.
Manual moderation, perhaps. I can't think of any way to automate this other than using some sort of blacklist, but even then that is not always reliable as newer sites might not be on the list.
Additionally, you could try using cURL and downloading the index page and looking for certain keywords that would raise a red flag, and then perhaps hold those for manual validation.
I would suggest having a list of these keywords in array (porn, sex, etc). If the index page that you downloaded with cURL has any of those keywords, reject or flag for moderation.
This is not reliable nor is it the most optimized way of approving links.
Ultimately, you should have manual moderation regardless, but if you wish to automate it, this is a possible route for you to take.
you can create a little monitoring system that will transfer this content created by user
to an approval queue that only administrators can access to approve the content that should
displayed at the site

Is it ok not to clean user input in this situation (PHP/MySQL)?

I always run user supplied input through both the html entities and mysql real escape string functions.
But now I am building a CMS which has a WYSIWYG editor in the admin section. I noticed that using htmlentities() on the WYSIWYG edited user content removed all styles and throws a bunch of quotes on the front end article page (as can be expected).
So is it ok to not clean the html/javascripts entered by the user in this situation? I will still use mysql_real_escape_string() which doesn't conflict.
Although the admin in the only one who will have access to the back end, I can think of at least one scenario where suppose a hacker somehow got access to the create a post page, now although they can wreak havoc by deleting posts, etc, instead they choose to use this as an opportunity to send visitors to his site by making this post:
<script>window.location = "http://evilsite.com"</script>
So what should I do? and also are there any functions that will disable javascript but not html and inline css?
The WYSWYG is TinyMCE by the way.
It is never OK to not clean user input. Anybody can sabotage your system, just like you hypothesized. This kind of risk is simply not worth taking.
Although, for your case it would depend on the WYSIWYG editor you use. Look around TinyMCE's documentation or ask around, and see what it says about displaying/rendering HTML output in its rich text editor with regards to XSS vulnerabilities.

Taking a hashed URL and sending it to a new URL

For example, I'd like to have my registration, about and contact pages resolve to different content, but via hash tags:
three links one each to the registration, contact and about page -
www.site.com/index.php#about
www.site.com/index.php#registration
www.site.com/index.php#contact
Is there a way using Javascript or PHP to resolve these pages to the separated content?
The hash is not sent to the server, so you can only do it in Javascript.
Check the value of location.hash.
There's no server-side way to do it. You could work with AJAX, but this will break the site for non-javascript users. The best way would probably be to have server-side content URLs (index.php?page=<page_id>) and rewrite these locally with JavaScript (to #<page_id>) and handle the content loading with AJAX then. That way you can have your hash-URLs for JS-enabled devices and everybody else can still use the site.
It does however require a bit of redundance because you need to provide the same content twice, once for inclusion via AJAX and once with the proper layout and everything via PHP.
If you just want hash URLs for aesthetic reasons, but don't want to rely on JS, you're out of luck. The semantics of URLs are against you: fragment IDs shouldn't really affect the content the URL is referring to, merely the fragment within that content. AJAX URLs are changing those semantics, but there's no good reason to do that if you don't have to.
I suppose you probably have a good reason, but can I ask, why would you do this? It breaks the widely understood standard of how hashs in URLs are supposed to work, and its just begging for trouble for interoperability with other clients, down the road.
You can use PHP's Global $_REQUEST variables to grab the requested URL and parse out the hashtag...

Categories