PHP: How can I disallow HTML content in user-generated content? - php

I run a niche social network site. I would like to disallow HTML content in user posted messages; such as embedded videos etc. what option is there in php to clean this up before I insert into the db.

There are three basic solutions:
Strip all HTML tags from the post. In PHP you can do this using the strip_tags() function.
Encode all the characters, so that if a user types <b>hello</b> it shows up as <b>hello</b> in the HTML, or <b>hello</b> on the page itself. In PHP this is the htmlspecialchars() function. (Note: in this situation you would generally store the content in the database as-is, and use htmlspecialchars wherever you output the content.)
Use a HTML sanitizer such as HTML Purifier. This allows users to use certain HTML formatting such as bold/italic, but blocks malicious Javascript and any other tags you wish (i.e. <object> in your case). You may or may not wish to do this before storing in the database, but you must always do it before output in either case.

You could use the strip_tags() function.

Related

html_entity_decode Terminate?

I'm using html_entity_decode($row['Content']) to display some JSON data that contains HTML in a PHP document. Problem is that some of the data being returned has open HTML tags such as <strong> which then carry on to the content displayed after.
Is there some way to terminate the HTML?
If you ever accept raw HTML from an outside source to embed into your site, you should always, always, reformat and whitelist it. You have no idea what that 3rd party HTML may contain, and you have no guarantee that it's valid; yet on your site you presumably want guaranteed valid HTML with certain limits on its content (or do you really want to enable the embedding of arbitrary <script> tags...?!).
That means you want to:
parse the HTML and extract whatever structural information is in it
filter that structure to allow only approved elements and then
produce your own HTML from that which you can guarantee is syntactically valid.
Supposedly the best PHP library which does that is HTML Purifier. Without using a library, you would use a lenient HTML parser, something like DOMDocument to inspect and filter the content, and then the built-in DOMDocument::saveXML to produce the new sanitised HTML.

use <br> tags only body part on Newsletter Email template in PHP [duplicate]

I got a textarea where the user can write an article. The article can contain text (bold and italic), links and youtube videos. How do I allow those certain html tags and still post secure xss-preventing code?
I would use HTMLPurifier, to ensure that you only keep HTML
That is valid
and only contains tags and attributes you've choosen to allow
I should add that PHP provides the strip_tags() function, but it's not that good (quoting) :
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that
you allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
If you are looking for real XSS protection I suggest to use HTMLPurifier. Doing it yourself is pretty hard if not impossible to do. And is bound to have mistakes ( / holes) in it.

White list of HTML tags I should allow from user generated content?

All,
I am building a small site using PHP. In the site, I receive user-generated text content. I want to allow some safe HTML tag (e.g., formatting) as well as MathML. How do I go about compiling a white list for a strip_tags() function? Is there a well accepted white list I can use?
The standard strip_tags function is not enough for security, since it doesn't validate attributes at all. Use a more complete library explicitly for the purpose of completely sanitizing HTML like HTML Purifier.
If your aim is to not allow javascript through, then your whitelist of tags is going to be pretty close to the empty set.
Remember that pretty much all tags can have event attributes that contain javascript code to be executed when the specified event occurs.
If you don't want to go down the HTMLPurifier kind of route, consider a different language, such as markdown (that this site uses) or some other wiki-like markup language; however, be sure to disable any use of passthrough HTML that may be allowed.

Selectively encoding HTML, how?

Allow me to explain my problem by before and after...
I have a comment system on a web community. Users can type in anything they want in a textarea, including special characters and HTML tags. In MySQL, I store the comment body exactly as typed, without any intervention. However, upon display I use HTML entities to prevent users from messing with HTML:
<?= nl2br(htmlentities($comment['body'], ENT_QUOTES, 'UTF-8')) ?>
This is working fine. However, I am now trying to enrich the comment system by automatically converting some links that are placed inside comments into richer objects. This concerns a photo forum and sometimes users make references to other photos by pasting in URLs in the comments:
'http://www.jungledragon.com/image/12/eagle.html
Using regular expressions, I am replacing valid links like the above into markup. In this case, it would be replaced with an img tag so that instead of a link, users see a thumb of that image directly inline in the comment.
The replacement is working fine. However, since I am using htmlentities, the replacement markup will render as text, rather than a rendered image. No surprises here.
My question is, how can I selectively html encode a comment body? I want these links replacements to not be escaped, but everything else should be escaped.
Do the htmlentities first and the replacing afterwords.
Usually, you'd use a library to sanitize the HTML instead. A few are listed here:
http://htmlpurifier.org/comparison

Want to 'sandbox' user form submitted HTML

I have a user form with a textarea that allows users to submit html formatted data. The html itself is limited by PHP strip_tags, but of course that does no completion checking etc.
My basic problem is that should a user leave a tag unclosed, such as the <a> tag, then all the content following that, including page content that follows that is 'outside' the user content display area, could now be malformed.
Checking for proper tag completion is one solution I will look at, but ideally I'd like to firewall the user htmlified content away from the rest of the site somehow.
Use HTML Purifier. Very thorough and easy-to-use standalone plugin. It makes sure all markup is valid XHTML and also prevents XSS attacks.
I would recommend saving two copies of the user's HTML input in your database. One copy would be the raw form that they submitted which you can use for when they edit their page later, and the second would be that sanitized by HTML Purifier which you display on output. Storing the sanitized version is much faster than runing HTML Purifier on every page load.
The only way to achieve complete isolation would be to use an iframe.
The other solution would be to limit the html tags users could employ. Limiting users to paragraph and inline tags (string, em, a, etc.) would ensure that you could wrap all of the content in a div tag and not have to worry about open tags.
Just use some function for completing unclosed tags.
This can help you:
http://concepts.waetech.com/unclosed_tags/

Categories