PHP HTML Purifier and MathML

PHP HTML Purifier and MathML - php

Is there any simple way to allow all MathML tags with attributes in HTML Purifier?
I tried to put all the MathML tags from https://developer.mozilla.org/en-US/docs/Web/MathML/Element/semantics with attributes to HTML.Allowed but I don't know if this is the right way.

There's currently no native support for MathML in HTML Purifier. There's an old pull request you could potentially repurpose here, but as it's a few years old patching it in will almost surely require significant manual effort; see also some discussion here:
The primary consideration is security. When adding a very big new
extension like MathML, it is very tempting to cut corners, and not
truly understand every corner of the specification and build a parser
that truly understands what it reads, and isn't just checking
syntax blindly.
Alternatively you could use the customization guide to add them as new tags and attributes to HTML Purifier, but that's more work, not less.
Simply adding the tags to HTML.Allowed won't do much - HTML Purifier's strength is that it understands the context that tags appear in, where they're allowed to appear and what restrictions make sense on their attributes (e.g. an attribute like 'width' takes integers, but an attribute like 'style' takes CSS (that will be sanitised separately), and an attribute like 'onclick' is unsafe by definition). If HTML Purifier doesn't know anything about a particular tag, it won't allow it, even if you add it to the allowlist, because it won't know how to actually handle the tag.
In short:
No, there is unfortunately no simple way to allow MathML in HTML Purifier.

Related

Sanitize HTML5 with PHP (prevent XSS)

I'm building WYSIWYG editor with HTML5 and Javascript.
I'll allow users post pure HTML via WYSIWYG, so it have to be sanitized.
Basic task like protecting site from cross site scripting (XSS) is coming difficult task, because there isn't up-to-date purify & filter -software for PHP.
HTML Purifier isn't support HTML5 at the moment and overall status looks very bad (HTML5 support isn't coming anytime soon).
So how should I sanitize untrusted HTML5 with PHP (backend) ?
Options so far...
HTML Purifier (lack of new HTML5 tags, data-attributes etc.)
Implementing own purifier with strip_tags() and Tidy or PHP's DOM classes/functions
Using some "random" Tidy implementations like http://eksith.wordpress.com/2013/11/23/whitelist-html-sanitizing-with-php/
Google Caja (Javascript / Cloud)
htmLawed (there's beta for HTML5 support)
Is there any other options out there? Is PHP dying? ;)

PHP offers parsing methods to protect from code PHP/SQL injections (i.e. mysql_real_escape_string()). This is not the case for HTML/CSS/JavaScript. Why that?
First: HTML/CSS/Javascript sole purpose is to display information. It is pretty much up to you to accept certain elements of HTML or reject them depending of your requirements.
Secondly: due to the very high number of HTML/CSS/JS elements (also increasing constantly), it is impossible to try to control HTML. you cannot expect a functional solution.
This is why I would suggest a top-down solution. I suggest to start restricting everything and then only allowing a certain number of tags. One good base is probably to use BBCdode, pretty popular. If you want to "unlock" additional specific tags beyond BBCode, you can always add some.
This is the reason BBCode-like scripts are popular on forums and websites (including stack overflow). WISIGIG editors are designed for admin/internal use, because you don't expect your website administrator to inject bad content.
bottom-top approaches are vowed to fail. HTML sanitizers are exposed to exponential complexity and do not guarantee anything.
EDIT 1
You say it is a sanitation problem, not a front end issue. I disagree, because as you cannot handle all present and future HTML entities you would better restrict it at a front end level to be 100% sure.
This said, perhaps the below is a working solution for you:
you can do a bit to sanitize your code by striping all entities
except those in a white list using PHP's strip_tags().
You can also remove all remaining tags attributes (properties)
by using PHP's preg_replace() with some regular expression.
$string = "put some very dirty HTML here.";
$string = strip_tags($string, '<p><a><span><h1><li><ul><br>');
$string = preg_replace("/<([b-z][b-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $string);
echo $string;
This will return your sanitized text.
note : I have excluded attributes removal for tags because you may still want to keep href="" properties. hence the [b-z][B-Z] regex.

I Believe the ideal is to use a combination :
mysql_real_escape_string(addslashes($_REQUEST['data']));
On Write
and
stripslashes($data)
on read always did the trick for me, I think it is better than
htmentities($data) on write
and
html_entity_decode($data) on read

White list of HTML tags I should allow from user generated content?

All,
I am building a small site using PHP. In the site, I receive user-generated text content. I want to allow some safe HTML tag (e.g., formatting) as well as MathML. How do I go about compiling a white list for a strip_tags() function? Is there a well accepted white list I can use?

The standard strip_tags function is not enough for security, since it doesn't validate attributes at all. Use a more complete library explicitly for the purpose of completely sanitizing HTML like HTML Purifier.

If your aim is to not allow javascript through, then your whitelist of tags is going to be pretty close to the empty set.
Remember that pretty much all tags can have event attributes that contain javascript code to be executed when the specified event occurs.
If you don't want to go down the HTMLPurifier kind of route, consider a different language, such as markdown (that this site uses) or some other wiki-like markup language; however, be sure to disable any use of passthrough HTML that may be allowed.

XSS - Which HTML Tags and Attributes can trigger Javascript Events?

I'm trying to code a secure and lightweight white-list based HTML purifier which will use DOMDocument. In order to avoid unnecessary complexity I am willing to make the following compromises:
HTML comments are removed
script and style tags are stripped all together
only the child nodes of the body tag will be returned
all HTML attributes that can trigger Javascript events will either be validated or removed
I've been reading a lot about on XSS attacks and prevention and I hope I'm not being too naive (if I am, please let me know!) in assuming that if I follow all the rules I mentioned above, I will be safe from XSS.
The problem is I am not sure what other tags and attributes (in any [X]HTML version and/or browser versions/implementations) can trigger Javascript events, besides the default Javascript event attributes:
onAbort
onBlur
onChange
onClick
onDblClick
onDragDrop
onError
onFocus
onKeyDown
onKeyPress
onKeyUp
onLoad
onMouseDown
onMouseMove
onMouseOut
onMouseOver
onMouseUp
onMove
onReset
onResize
onSelect
onSubmit
onUnload
Are there any other non-default or proprietary event attributes that can trigger Javascript (or VBScript, etc...) events or code execution? I can think of href, style and action, for instance:
XSS // or
<b style="width: expression(alert(document.location));">XSS</b> // or
<form action="javascript:alert(document.location);"><input type="submit" /></form>
I will probably just remove any style attributes in the HTML tags, the action and href attributes pose a bigger challenge but I think the following code is enough to make sure their value is either a relative or absolute URL and not some nasty Javascript code:
$value = $attribute->value;
if ((strpos($value, ':') !== false) && (preg_match('~^(?:(?:s?f|ht)tps?|mailto):~i', $value) == 0))
{
$node->removeAttributeNode($attribute);
}
So, my two obvious questions are:
Am I missing any tags or attributes that can trigger events?
Is there any attack vector that is not covered by these rules?
After a lot of testing, pondering and researching I've come up with the following (rather simple) implementation which, appears to be immune to any XSS attack vector I could throw at it.
I highly appreciate all your valuable answers, thanks.

You mention href and action as places javascript: URLs can appear, but you're missing the src attribute among a bunch of other URL loading attributes.
Line 399 of the OWASP Java HTMLPolicyBuilder is the definition of URL attributes in a white-listing HTML sanitizer.
private static final Set<String> URL_ATTRIBUTE_NAMES = ImmutableSet.of(
"action", "archive", "background", "cite", "classid", "codebase", "data",
"dsync", "formaction", "href", "icon", "longdesc", "manifest", "poster",
"profile", "src", "usemap");
The HTML5 Index contains a summary of attribute types. It doesn't mention some conditional things like <input type=URL value=...> but if you scan that list for valid URL and friends, you should get a decent idea of what HTML5 adds. The set of HTML 4 attributes with type %URI is also informative.
Your protocol whitelist looks very similar to the OWASP sanitizer one. The addition of ftp and sftp looks innocuous enough.
A good source of security related schema info for HTML element and attributes is the Caja JSON whitelists which are used by the Caja JS HTML sanitizer.
How are you planning on rendering the resulting DOM? If you're not careful, then even if you strip out all the <script> elements, an attacker might get a buggy renderer to produce content that a browser interprets as containing a <script> element. Consider the valid HTML that does not contain a script element.
<textarea></textarea><script>alert(1337)</script></textarea>
A buggy renderer might output the contents of this as:
<textarea></textarea><script>alert(1337)</script></textarea>
which does contain a script element.
(Full disclosure: I wrote chunks of both HTML sanitizers mentioned above.)

Garuda has already given what I would deem as the "correct" answer, and his links are very useful, but he beat me to the punch!
I give my answer only to reinforce.
In this day and age of increasing features in the html and ecmascript specs, avoiding script injection and other such vulnerabilities in html becomes more and more difficult. With each new addition, a whole world of possible injections is introduced. This is coupled with the fact that different browsers probably have different ideas of how they are going to implement these specs, so you get even more possible vulnerabilities.
Take a look at a short list of vectors introduced by html 5
The best solution is choose what you will allow rather than what you will deny. It is much easier to say "These tags and these attributes for those given tags alone are allowed. Everything else will sanitized accordingly or thrown out."
It would be very irresponsible for me to compile a list and say "okay, here you go: here's a list of all of the injection vectors you missed. You can sleep easy." In fact, there are probably many injection vectors that are not even known by black hats or white hats. As the ha.ckers website states, script injection is really only limited by the mind.
I'd like to answer your specific question at least a little bit, so here are some glaring omissions from your blacklist:
img src attribute. I think it is important to note that src is a valid attribute on other elements and could be potentially harmful. img also dynsrc and lowsrc, maybe even more.
type and language attributes
CDATA in addition to just html comments.
Improperly sanitized input values. This may not be a problem depending upon how strict your html parsing is.
Any ambiguous special characters. In my opinion, even unambiguous ones should probably be encoded.
Missing or incorrect quotes on attributes (such as grave quotes).
Premature closing of textarea tags.
UTF-8 (and 7) encoded characters in scripts
Even though you will only return child nodes of the body tag, many browsers will still evaluate head, and html elements inside of body, and most head-only elements inside of body anyway, so this probably won't help much.
In addition to css expressions, background image expressions
frames and iframes
embed and probably object and applet
Server side includes
PHP tags
Any other injections (SQL Injection, executable injection, etc.)
By the way, I'm sure this doesn't matter, but camelCased attributes are invalid xhtml and should be lower cased. I'm sure this doesn't affect you.

You might want to check these 2 links out for additional reference:
http://adamcecc.blogspot.com/2011/01/javascript.html (this is only applicable when you're 'filtered' input is ever going to find itself between script tags on a page)
http://ha.ckers.org/xss.html (which has a lot of browser-specific event triggers listed)
I've used HTML Purifier, as you are doing, for this reason too in combination with a wysiwyg-editor. What i did different is using a very strict whitelist with a couple of basic markup tags and attributes available and expanding it when the need arose. This keeps you from getting attacked by very obscure vectors (like the first link above) and you can dig in on the newly needed tag/attribute one by one.
Just my 2 cents..

Don't forget the HTML5 JavaScript event handlers
http://www.w3schools.com/html5/html5_ref_eventattributes.asp

Alternative of html purifier

I want to accept to accept the html input from user and post it on my site also want to make sure that it don't create problem with my site template due to dirty html code.
I was using html purifier in the past but Html purifier is not working on one of my server. So I am searching for best alternative.
Which is purely written in php.
which can fix the dirty html code like
</div> it is dirty code as div is closed without opening.

Simple solution without third-party libraries: create a DOMDocument and call loadHTML on it with your input. Surrounded the input with <html> and <body> tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.
Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:, and not potentially troublesome schemes like javascript:. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).
Finally, serialise the snippet's node again using saveHTML. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.

You can try PHP Tidy, which is the Tidy library in PHP.

I believe Tidy will help close your tags, but it isn't as comprehensive as HTML Purifier which can remove valid but unwanted tags or attributes (i.e. JavaScript onclick events, that kind of thing).
Be aware that Tidy requires libtidy to be installed on your server, so it's not just straight PHP.
I know Pádraic Brady has been working on an alternative to HTML Purifier for Zend Framework, though I think its just experimental code at this time
http://framework.zend.com/wiki/pages/viewpage.action?pageId=25002168
http://github.com/padraic/wibble

Do also consider HTMLawed at https://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/
From that page;
use to filter, secure & sanitize HTML in blog comments or forum posts, generate XML-
compatible feed items from web-page excerpts, convert HTML to XHTML, pretty-print
HTML, scrape web-pages, reduce spam, remove XSS code, etc.
Note that Tidy/HTML Tiday is NOT a anti XSS solution. It is a clean and repair utility which allows you to clean HTML, XHTML, and XML markup.
HTMLawed is a 55kb single php file whilst HTML Purifer is a 3 MB folder.

Strict HTML Validation and Filtering in PHP

I'm looking for best practices for performing strict (whitelist) validation/filtering of user-submitted HTML.
Main purpose is to filter out XSS and similar nasties that may be entered via web forms. Secondary purpose is to limit breakage of HTML content entered by non-technical users e.g. via WYSIWYG editor that has an HTML view.
I'm considering using HTML Purifier, or rolling my own by using an HTML DOM parser to go through a process like HTML(dirty)->DOM(dirty)->filter->DOM(clean)->HTML(clean).
Can you describe successes with these or any easier strategies that are also effective? Any pitfalls to watch out for?

I've tested all exploits I know on HTML Purifier and it did very well. It filters not only HTML, but also CSS and URLs.
Once you narrow elements and attributes to innocent ones, the pitfalls are in attribute content – javascript: pseudo-URLs (IE allows tab characters in protocol name - java script: still works) and CSS properties that trigger JS.
Parsing of URLs may be tricky, e.g. these are valid: http://spoof.com:xxx#evil.com or //evil.com.
Internationalized domains (IDN) can be written in two ways – Unicode and punycode.
Go with HTML Purifier – it has most of these worked out. If you just want to fix broken HTML, then use HTML Tidy (it's available as PHP extension).

User-submitted HTML isn't always valid, or indeed complete. Browsers will interpret a wide range of invalid HTML and you should make sure you can catch it.
Also be aware of the valid-looking:
<img src="http://www.mysite.com/logout" />
and
click

I used HTML Purifier with success and haven't had any xss or other unwanted input filter through. I also run the sanitize HTML through the Tidy extension to make sure it validates as well.

The W3C has a big open-source package for validating HTML available here:
http://validator.w3.org/
You can download the package for yourself and probably implement whatever they're doing. Unfortunately, it seems like a lot of DOM parsers seem to be willing to bend the rules to allot for HTML code "in the wild" as it were, so it's a good idea to let the masters tell you what's wrong and not leave it to a more practical tool--there are a lot of websites out there that aren't perfect, compliant HTML but that we still use every day.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.