Selectively encoding HTML, how? - php

Allow me to explain my problem by before and after...
I have a comment system on a web community. Users can type in anything they want in a textarea, including special characters and HTML tags. In MySQL, I store the comment body exactly as typed, without any intervention. However, upon display I use HTML entities to prevent users from messing with HTML:
<?= nl2br(htmlentities($comment['body'], ENT_QUOTES, 'UTF-8')) ?>
This is working fine. However, I am now trying to enrich the comment system by automatically converting some links that are placed inside comments into richer objects. This concerns a photo forum and sometimes users make references to other photos by pasting in URLs in the comments:
'http://www.jungledragon.com/image/12/eagle.html
Using regular expressions, I am replacing valid links like the above into markup. In this case, it would be replaced with an img tag so that instead of a link, users see a thumb of that image directly inline in the comment.
The replacement is working fine. However, since I am using htmlentities, the replacement markup will render as text, rather than a rendered image. No surprises here.
My question is, how can I selectively html encode a comment body? I want these links replacements to not be escaped, but everything else should be escaped.

Do the htmlentities first and the replacing afterwords.

Usually, you'd use a library to sanitize the HTML instead. A few are listed here:
http://htmlpurifier.org/comparison

Related

Use WYSIWYG Editor with PHP escape Method

I am building a small/test CMS using Php and Mysql.
Everything is working amazingly on the adding, editing, deleting and displaying level, but after finishing my code, I wanted to add a WYSIWYG editor in the Admin back end.
My problem is that I am using escape method to hopefully make my form a bit more secure and try to escape injections, therefore when adding a styled text, image or any other HTML code in my Editor I am getting them printed as line codes on my page(Which is completely right to avoid attacks).
MY ESCAPE METHOD:
function e($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');}
Is there any way to work around my escape method (which is think it should not be done because if I can do it every attacker could).
Or should I change my escape method to another method?
If I understand you correctly you are going to allow your users to put some formatting into the text they are going to create. For this you are going to add some WYSISWYG editor. But the question is how to distinguish the formatting and special characters which are allowed from what is not allowed. You need to clean up the text and leave only valid allowed formatting (HTML tags) and remove all malicious JavaScript or HTML.
This is not an easy task like it might sound at the first moment. I can see several approaches here.
Easiest solution to use strip_tags and specify what tags are allowed.
But please keep in mind that strip_tags is not perfect. Let me quote the manual here.
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that you
allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
This is a known issue. And libraries exist which do a better cleanup of HTML and JS to prevent breaks.
A bit more complicated solution would be to use some advanced library to cleanup the HTML code. For example this might be HTML Purifier
Quote from the documentation
HTML Purifier will not only remove all malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are standards compliant, something
only achievable with a comprehensive knowledge of W3C's
specifications.
The other libraries exist which solve the same task. You can check for example this article where libraries are compared. And finally you might choose the best one.
Completely different approach is to avoid users from writing HTML tags. Ask them to write some other markup instead like this is done on StackOverflow or Basecamp or GitHub. Markdown might be a good approach.
Using simple markup for text allows you to complete avoid issues with broken HTML and JavaScript cause you can escape everything and build HTML markup on your own.
The editor might look like the one I'm using to write this message :)
You can use strip_tags() to remove the unwanted tags. Read about it on this manual:
http://php.net/manual/en/function.strip-tags.php
Example 1 (Based on the manual)
<?php
$text = '<p>Test paragraph, With link.</p>';
# Output: Test paragraph, With link. (Tags are stripped)
echo strip_tags($text);
echo "\n";
# Allow <p> and <a>
#Output: <p>Test paragraph, With link.</p>
echo strip_tags($text, '<p><a>');
?>
I hope this will help you!

HTML escape only some characters and combination of characters?

I'm managing a blog where a select few people can submit their own articles and entries. I want them to be able to embed video via HTML (and bold, italicize, etc text at their choosing). How do I do this while maintaining site security?
If I don't HTML escape the actual article space, an open comment will ruin my site. Is there a way to selectively escape some combination of characters?
edit; hopefully without writing my own parser. I just want simple things like <b>, <i>, etc tags unescaped, as well as video and link embedding.
I use what SO uses. it is opensource and has parsers for many languages.
The name is WMD and the question "Where's the WMD editor open source project?" has some QA material outlining this editor.
The question "running showdown.js serverside to conver Markdown to HTML (in PHP)" has some QA material outlining some Markdown libraries in PHP.
You can safely HTML escape everything. URL's for your videos will be unaffected by whatever escaping you want to do.
The simplest way to do this that most sites (such as SO) use is to introduce your own special markup, which is then translated into the features that you want.
For example, SO uses asterisks (*) to italicize and (**) bold (Edit: next to the HTML tags <b></b> itself, see source of this answer).
Other sites use [b] and [i] tags. You could have a [video=http://myvideo.com] tag, which your PHP then translates into the appropriate HTML entity.

PHP: How can I disallow HTML content in user-generated content?

I run a niche social network site. I would like to disallow HTML content in user posted messages; such as embedded videos etc. what option is there in php to clean this up before I insert into the db.
There are three basic solutions:
Strip all HTML tags from the post. In PHP you can do this using the strip_tags() function.
Encode all the characters, so that if a user types <b>hello</b> it shows up as <b>hello</b> in the HTML, or <b>hello</b> on the page itself. In PHP this is the htmlspecialchars() function. (Note: in this situation you would generally store the content in the database as-is, and use htmlspecialchars wherever you output the content.)
Use a HTML sanitizer such as HTML Purifier. This allows users to use certain HTML formatting such as bold/italic, but blocks malicious Javascript and any other tags you wish (i.e. <object> in your case). You may or may not wish to do this before storing in the database, but you must always do it before output in either case.
You could use the strip_tags() function.

Is it possible to execute some tags within a textarea as leaving the rest as plaintext?

I am developing a php-based web application in which there is a text area within which user can type whatever he/she wants and the content later gets displayed on another page after being stored in a database. The scenario is that the user can type in HTML tags. But as far as functionality constraints are concerned, I wish to allow the user to execute some tags such as <a>, <div> etc., leaving the rest of the tags to be displayed as plaintext.
I had previously pasted this question:
Prevent HTML data from being posted into form textboxes
But it answered only the ways such as strip_tags() and htmlspecialchars() which either stripped the html content completely displaying the remaining plaintext or displayed everything as plaintext with no option for adding any tag as exception, respectively. Please help. Cheers.
You can look at HTML Purifier. This is a library specially designed for this.
It seems it can handle any form of xss attack. See also the comparison page.
as told in the last post , strip_tags() is the answer, if you bothered to read the manual page for strip_tags() ,you will see you can tell it what tags to allow, which is exactly what you want.
Check the documentation for strip_tags and you'll see that the second (optional) argument accepted is an array of allowable tags.
Edit: Misunderstood it. Never mind D: More sleep is needed methinks. Looks like you should just run a htmlspecialchars and reconvert the required tags back with a regex
Get PHP's translation table and strip out the ones you don't want, then call strtr();
$table = get_html_translation_table(HTML_SPECIALCHARS);
$table['allowed_tag'] = "";
$table['another_allowed_tag'] = "";
strtr($str, $table);
I haven't tested but it should work.

Is it possible to only allow img tags in html comment post?

I have a comment form on my website which, at the moment I filter out all html and turn it into plain text and also replace bad words with funny words. I want to be able to allow users to post images. I couldn't see how to incorporate this to the comment page so have set it up on a separate page just dedicated to users posting images. But, I still don't want to allow any other html except img. Also, protect from sql injection.
Does anyone have any ideas?
Thanks.
Two decent methods would be using Tidy or HTMLPurifier. Both filter HTML very well and are highly customizable to suit your needs.
With purifier (I speak from experience as I have used it) it will allow you to add something like:
img[src,alt,title]
To the allowed tags property, which allows only those attributes in the img tag. See the website for more information / usages.
Yes, you can pass a list of allowable tags to php's strip_tags() function:
$clean_text = strip_tags($html_text, "<img>") ;

Categories