I work on a web application that uses Markdown as its syntax, the only issue I am facing is how to validate the user input on the server side so that it is actually Markdown and not some XSS attack that could be injected using a POST request or by disabling javascript.
I know StackOverflow does this but how do they do it and allow certain HTML tags including images that are prone to XSS attacks? Any open source package that can help (examples appreciated).
Becaue I heard that StackOverflow uses it, I will be trying out Pagedown as client side validator.
You need to invest ca. one to two weeks of proper coding and get some tagsoup parser / handler finsihed that can sanitze the incomming HTML (via Markdown).
I highly suggest a three pass validation and processing scheme:
Mix-Mode: Whitelist incomming HTML tags that are part of the Markdown document.
Markdown Parser: Transform Markdown into HMTL
HTML-Mode: Whitelist HTML tags that are the HTML document.
You can then output. Store both, the Markdown source and the "backed" HTML data so you don't need to do this for every display operation.
Markdown allows arbitrary HTML to be included in it. Since this includes <script> elements, you can have valid Markdown that is also an XSS attack.
Run the incoming data through a Markdown parser to get HTML, then treat it like any other user submitted HTML (pass it through an HTML parser that applies a whitelist to the elements and attributes).
Related
I'm using html_entity_decode($row['Content']) to display some JSON data that contains HTML in a PHP document. Problem is that some of the data being returned has open HTML tags such as <strong> which then carry on to the content displayed after.
Is there some way to terminate the HTML?
If you ever accept raw HTML from an outside source to embed into your site, you should always, always, reformat and whitelist it. You have no idea what that 3rd party HTML may contain, and you have no guarantee that it's valid; yet on your site you presumably want guaranteed valid HTML with certain limits on its content (or do you really want to enable the embedding of arbitrary <script> tags...?!).
That means you want to:
parse the HTML and extract whatever structural information is in it
filter that structure to allow only approved elements and then
produce your own HTML from that which you can guarantee is syntactically valid.
Supposedly the best PHP library which does that is HTML Purifier. Without using a library, you would use a lenient HTML parser, something like DOMDocument to inspect and filter the content, and then the built-in DOMDocument::saveXML to produce the new sanitised HTML.
I'm building WYSIWYG editor with HTML5 and Javascript.
I'll allow users post pure HTML via WYSIWYG, so it have to be sanitized.
Basic task like protecting site from cross site scripting (XSS) is coming difficult task, because there isn't up-to-date purify & filter -software for PHP.
HTML Purifier isn't support HTML5 at the moment and overall status looks very bad (HTML5 support isn't coming anytime soon).
So how should I sanitize untrusted HTML5 with PHP (backend) ?
Options so far...
HTML Purifier (lack of new HTML5 tags, data-attributes etc.)
Implementing own purifier with strip_tags() and Tidy or PHP's DOM classes/functions
Using some "random" Tidy implementations like http://eksith.wordpress.com/2013/11/23/whitelist-html-sanitizing-with-php/
Google Caja (Javascript / Cloud)
htmLawed (there's beta for HTML5 support)
Is there any other options out there? Is PHP dying? ;)
PHP offers parsing methods to protect from code PHP/SQL injections (i.e. mysql_real_escape_string()). This is not the case for HTML/CSS/JavaScript. Why that?
First: HTML/CSS/Javascript sole purpose is to display information. It is pretty much up to you to accept certain elements of HTML or reject them depending of your requirements.
Secondly: due to the very high number of HTML/CSS/JS elements (also increasing constantly), it is impossible to try to control HTML. you cannot expect a functional solution.
This is why I would suggest a top-down solution. I suggest to start restricting everything and then only allowing a certain number of tags. One good base is probably to use BBCdode, pretty popular. If you want to "unlock" additional specific tags beyond BBCode, you can always add some.
This is the reason BBCode-like scripts are popular on forums and websites (including stack overflow). WISIGIG editors are designed for admin/internal use, because you don't expect your website administrator to inject bad content.
bottom-top approaches are vowed to fail. HTML sanitizers are exposed to exponential complexity and do not guarantee anything.
EDIT 1
You say it is a sanitation problem, not a front end issue. I disagree, because as you cannot handle all present and future HTML entities you would better restrict it at a front end level to be 100% sure.
This said, perhaps the below is a working solution for you:
you can do a bit to sanitize your code by striping all entities
except those in a white list using PHP's strip_tags().
You can also remove all remaining tags attributes (properties)
by using PHP's preg_replace() with some regular expression.
$string = "put some very dirty HTML here.";
$string = strip_tags($string, '<p><a><span><h1><li><ul><br>');
$string = preg_replace("/<([b-z][b-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $string);
echo $string;
This will return your sanitized text.
note : I have excluded attributes removal for tags because you may still want to keep href="" properties. hence the [b-z][B-Z] regex.
I Believe the ideal is to use a combination :
mysql_real_escape_string(addslashes($_REQUEST['data']));
On Write
and
stripslashes($data)
on read always did the trick for me, I think it is better than
htmentities($data) on write
and
html_entity_decode($data) on read
All,
I am building a small site using PHP. In the site, I receive user-generated text content. I want to allow some safe HTML tag (e.g., formatting) as well as MathML. How do I go about compiling a white list for a strip_tags() function? Is there a well accepted white list I can use?
The standard strip_tags function is not enough for security, since it doesn't validate attributes at all. Use a more complete library explicitly for the purpose of completely sanitizing HTML like HTML Purifier.
If your aim is to not allow javascript through, then your whitelist of tags is going to be pretty close to the empty set.
Remember that pretty much all tags can have event attributes that contain javascript code to be executed when the specified event occurs.
If you don't want to go down the HTMLPurifier kind of route, consider a different language, such as markdown (that this site uses) or some other wiki-like markup language; however, be sure to disable any use of passthrough HTML that may be allowed.
I want to accept to accept the html input from user and post it on my site also want to make sure that it don't create problem with my site template due to dirty html code.
I was using html purifier in the past but Html purifier is not working on one of my server. So I am searching for best alternative.
Which is purely written in php.
which can fix the dirty html code like
</div> it is dirty code as div is closed without opening.
Simple solution without third-party libraries: create a DOMDocument and call loadHTML on it with your input. Surrounded the input with <html> and <body> tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.
Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:, and not potentially troublesome schemes like javascript:. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).
Finally, serialise the snippet's node again using saveHTML. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.
You can try PHP Tidy, which is the Tidy library in PHP.
I believe Tidy will help close your tags, but it isn't as comprehensive as HTML Purifier which can remove valid but unwanted tags or attributes (i.e. JavaScript onclick events, that kind of thing).
Be aware that Tidy requires libtidy to be installed on your server, so it's not just straight PHP.
I know Pádraic Brady has been working on an alternative to HTML Purifier for Zend Framework, though I think its just experimental code at this time
http://framework.zend.com/wiki/pages/viewpage.action?pageId=25002168
http://github.com/padraic/wibble
Do also consider HTMLawed at https://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/
From that page;
use to filter, secure & sanitize HTML in blog comments or forum posts, generate XML-
compatible feed items from web-page excerpts, convert HTML to XHTML, pretty-print
HTML, scrape web-pages, reduce spam, remove XSS code, etc.
Note that Tidy/HTML Tiday is NOT a anti XSS solution. It is a clean and repair utility which allows you to clean HTML, XHTML, and XML markup.
HTMLawed is a 55kb single php file whilst HTML Purifer is a 3 MB folder.
I'm looking for best practices for performing strict (whitelist) validation/filtering of user-submitted HTML.
Main purpose is to filter out XSS and similar nasties that may be entered via web forms. Secondary purpose is to limit breakage of HTML content entered by non-technical users e.g. via WYSIWYG editor that has an HTML view.
I'm considering using HTML Purifier, or rolling my own by using an HTML DOM parser to go through a process like HTML(dirty)->DOM(dirty)->filter->DOM(clean)->HTML(clean).
Can you describe successes with these or any easier strategies that are also effective? Any pitfalls to watch out for?
I've tested all exploits I know on HTML Purifier and it did very well. It filters not only HTML, but also CSS and URLs.
Once you narrow elements and attributes to innocent ones, the pitfalls are in attribute content – javascript: pseudo-URLs (IE allows tab characters in protocol name - java script: still works) and CSS properties that trigger JS.
Parsing of URLs may be tricky, e.g. these are valid: http://spoof.com:xxx#evil.com or //evil.com.
Internationalized domains (IDN) can be written in two ways – Unicode and punycode.
Go with HTML Purifier – it has most of these worked out. If you just want to fix broken HTML, then use HTML Tidy (it's available as PHP extension).
User-submitted HTML isn't always valid, or indeed complete. Browsers will interpret a wide range of invalid HTML and you should make sure you can catch it.
Also be aware of the valid-looking:
<img src="http://www.mysite.com/logout" />
and
click
I used HTML Purifier with success and haven't had any xss or other unwanted input filter through. I also run the sanitize HTML through the Tidy extension to make sure it validates as well.
The W3C has a big open-source package for validating HTML available here:
http://validator.w3.org/
You can download the package for yourself and probably implement whatever they're doing. Unfortunately, it seems like a lot of DOM parsers seem to be willing to bend the rules to allot for HTML code "in the wild" as it were, so it's a good idea to let the masters tell you what's wrong and not leave it to a more practical tool--there are a lot of websites out there that aren't perfect, compliant HTML but that we still use every day.