Validation for comment forms

Validation for comment forms - php

I am working on a comment system in Codeigniter and would appreciate some advice on what kind of validation rules that I should employ. I don't want to allow any images or other any HTML.
So far I just have trim and max_length set. I also run the content through htmlspecialchars before I insert in the database. I have XSS filtering enabled globally.
What other precautions should I take? Is htmlspecialchars enough for preventing Javascript or other malicious code from being entered?

You should probably do a regular form validation on required and max_length, and obviously xss filtering before pushing things to the database. The htmlspecialchars should only be applied to characters that aren't in tags, so you can't just do htmlspecialchars directly. You need to:
1 - strip the tag elements (and store them) like "<br/>" or "<b>", but not their content, that means nothing inside the "<b>" and "</b>". You can probably do this with a preg_match.
2 - execute htmlentities on all the remaining text
3 - remove all unwanted explicit tags (from the stored bunch of tags)
strip_tags ( string $str [, string $allowable_tags ] )
4 - then filter the allowed tags for attributes and content. It's not uncommon for hackers to use code like
<b onMouseOver="window.open(..)"></b>
To fix this, either you'll have to do a little bit of extra work and probably work with some regex-es. If you want me to write some more sample code let me know.
6 - re-add the tag elements back to the document.
I just basically cooked this up right now. The algorithm can be improved in efficiency (i.e. strip the unwanted tags first, and then proceed with filtering html entities and tag contents) but I'll leave that up to you.
This is as far as I can see the potential hacks right now. There might be other ways to hack your input though, so you might want to check what other comment box systems out there use for their validation, such as the phpbb forum system. Another option might be to use the phpbb square-bracket format to deal with tags so you don't let users input ANY html tags whatsoever, but instead use square-bracket tags that you control.
Does this answer your question ?

Related

php character limits (trim an html paragraph)

We have our own blog system and the post data is stores a raw html, so when it's called from the db we can just echo it and it's formatted completely, no need for BB codes in our situation. Our issue now is that our blog posts sometimes are too long and need to be trimmed.
The problem is that our data contains html, mostly <font>, <span>, <p>, <b>, and other styling tags. I made a php function that trims the characters, but it doesn't take into account the html tags. If the trim function trims the blog it should not trim tags because it messes the whole page. The function needs to be able to close the html tags if they're trimmed. Is there a function out there that can do this? or a function where I could start and build from it?

There's a good example here of truncating text while preserving HTML tags.

There is strip_tags which gets rid of all HTML tags but other than that there isn't much.
This is not an easy thing by the way, you have to actually parse the HTML to find out which tags are left open - that's the most robust approach anyway. Also, don't use a regular expression.

The right solution is to not store display information in your database layer.
Failing that, you could use CSS overflow properties: print the whole post, and then have the display layer handle sizing it to fit. This mitigates the problem of having formatting information in your database by putting the resizing (a display issue, not a content issue) into the display layer as well.
Failing that, you could parse the HTML and "round up" or "round down" to the nearest tag boundary, then insert the tag-close characters necessary to finish the block you were in.
Another option is to iframe the content.

I know this isn't the best way to do it programatically, but have you considered manually specifying where the cut should be? Adding something like and cutting it there manually would allow you to control where the cut happened, regardless of the number of characters before it. For example, you could always put that below the first paragraph.
Admittedly, you lose the ability to just have it happen automatically, but I bring it up in case that doesn't matter as much to you.

Cleaning an HTML string saving some tags and attributes

After I implemented my sanitize functions (according to requested specifics), my boss decided to change the accepted input. Now he wants to keep some specific tag and its attributes. I suggested to implement a BBCode-like language which is safer imho but he doesn't want to because it would be to much work.
This time I would like to keep it simple so I will not kill him the next time he asks me to change again this thing. And I know he will.
Is it enough to use first the strip_tags with the tag parameter to preserve and then htmlentities?

strip_tags does not necessarily result in safe content. strip_tags followed by htmlentities would be safe, in that anything HTML-encoded is safe, but it doesn't make any sense.
Either the user is inputting plain text, in which case it should be output using htmlspecialchars (in preference to htmlentities), or they're inputting HTML markup, in which case you need to parse it properly, fixing broken markup and removing elements/attributes that aren't in a safe whitelist.
If that's what you want, use an existing library to do it (eg. htmlpurifier). Because it's not a trivial task and if you get it wrong you've given yourself XSS security holes.

You can keep specific tags using strip_tags with this syntax: strip_tags($text, '<p><a>');
That snippet would strip all tags except p and a. Attributes are kept for tags you have allowed (p and a in the above example).
However, this doesn't mean that the attributes are safe. Does he want specific attributes or does he want to keep all of them on allowed tags? For the first case, you would need to parse each tag and remove the ones desired, sanitizing the values. To keep all attributes on allowed tags, you still need to sanitize them. I would recommend running htmlentities on the attribute values to sanitize them (for display, I would assume).

How to prevent XSS attack with Zend Form using %

our company has made a website for our client. The client hired a webs security company to test the pages for security before the product launches.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
They ran another test and it still failed :(
They used the following for the one input field in the data of the http header: name=%3Cscript%3Ealert%28123%29%3C%2Fscript%3E which basically translates to name=<script>alert(123);</script>
I've added alpha and alnum to some of the fields, which fixes the XSS vulnerability (touch wood) by removing the %, however, now the boss don't like it because what of O'Brien and double-barrel surnames...
I haven't come across the %3C as < problem reading up about XSS. Is there something wrong with my html character set or encoding or something?
I probably now have to write a custom filter, but that would be a huge pain to do that with every website and deployment. Please help, this is really frustrating.
EDIT:
if it's about escaping the form's output, how do I do that? The form submits to the same page - how do I escape if I only have in my view <?= $this->form ?>
How can I get Zend Form to escape it's output?

%3Cscript%3Ealert%28123%29%3C%2Fscript%3E is the URL-encoded form of <script>alert(123);</script>. Any time you include < in a form value, it will be submitted to the server as %3C. PHP will read and decode that back to < before anything in your application gets a look at it.
That is to say, there is no special encoding that you have to handle; you won't actually see %3C in your input, you see <. If you're failing to encode that for on-page display then you don't have even the most basic defenses against XSS.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
I'm afraid you have not fixed your XSS problems at all. You may have merely obfuscated them.
Input filtering is a depressingly common but quite wrong strategy for blocking XSS.
It is not the input that's the problem. As your boss says, there is no reason you shouldn't be able to input O'Brien. Or even <script>, like I am just now in this comment box. You should not attempt to strip tags in the input or even HTML-encode them, because who knows at input-time that the data is going to end up in an HTML page? You don't want your database filled with nonsense like 'Fish&Chips' which then ends up in an e-mail or other non-HTML context with weird HTML escapes in it.
HTML-encoding is an output-stage issue. Leave the incoming strings alone, keep them as raw strings in the database (of course, if you are hacking together queries in strings to put the data in the database instead of parameterised queries, you would need to SQL-escape the content at exactly that point). Then only when you are inserting the values in HTML, encode them:
Name: <?php echo htmlspecialchars($row['name']); ?>
If you have a load of dodgy code like echo "Name: $name"; then I'm afraid you have much rewriting to do to make it secure.
Hint: consider defining a function with a short name like h so you don't have to type htmlspecialchars so much. Don't use htmlentities which will usually-unnecessarily encode non-ASCII characters, which will also mess them up unless you supply a correct $charset argument.
(Or, if you are using Zend_View, $this->escape().)
Input validation is useful on an application-specific level, for things like ensuring telephone number fields contain numbers and not letters. It is not something you can apply globally to avoid having to think about the issues that arise when you put a string inside the context of another string—whether that's inside HTML, SQL, JavaScript string literals or one of the many other contexts that require escaping.

If you correctly escape strings every time you write them to the HTML page, you won't have any issues.
%3C is a URL-encoded <; it is decoded by the server.

User input filtering - do I need to filter HTML?

Note: I take care of SQL injection and output escaping elsewhere - this question is about input filtering only, thanks.
I'm in the middle of refactoring my user input filtering functions. Before passing the GET/POST parameter to a type-specific filter with filter_var() I do the following:
check the parameter encoding with mb_detect_encoding()
convert to UTF-8 with iconv() (with //IGNORE) if it's not ASCII or UTF-8
clean white-spaces with a function found on GnuCitizen.org
pass the result thru strip_tags() - no tags allowed at all, Markdown only
Now the question: does it still make sense to pass the parameter to a filter like htmLawed or HTML Purifier, or can I think of the input as safe? It seems to me that these two differ mostly on the granularity of allowed HTML elements and attributes (which I'm not interested into, as I remove everything), but htmLawed docs have a section about 'dangerous characters' that suggests there might be a reason to use it. In this case, what would be a sane configuration for it?

There are many different approaches to XSS that are secure. The only why to know if your approach holds water is to test though exploitation. I recommend using a Free XSS vulnerability Scanner*, or the open source wapiti.
To be honest I'll never use strip_tags() becuase you don't always need html tags to execute javascript! I like htmlspecialchars($var,ENT_QUOTES); .
For instance this is vulnerable to xss:
print('link');
You don't need <> to execute javascript in this case because you can use
onmouseover, here is an example attack:
$_REQUEST[xss]='" onMouseOver="alert(/xss/)"';
The ENT_QUOTES will take care of the double quotes which will patch this XSS vulnerability.
*I am affiliated with this site/service.

i think what you're doing is safe, at least from my point of view no html code should get through your filter

What is the correct/safest way to escape input in a forum?

I am creating a forum software using php and mysql backend, and want to know what is the most secure way to escape user input for forum posts.
I know about htmlentities() and strip_tags() and htmlspecialchars() and mysql_real_escape_string(), and even javascript's escape() but I don't know which to use and where.
What would be the safest way to process these three different types of input (by process, I mean get, save in a database, and display):
A title of a post (which will also be the basis of the URL permalink).
The content of a forum post limited to basic text input.
The content of a forum post which allows html.
I would appreciate an answer that tells me how many of these escape functions I need to use in combination and why.
Thanks!

When generating HTLM output (like you're doing to get data into the form's fields when someone is trying to edit a post, or if you need to re-display the form because the user forgot one field, for instance), you'd probably use htmlspecialchars() : it will escape <, >, ", ', and & -- depending on the options you give it.
strip_tags will remove tags if user has entered some -- and you generally don't want something the user typed to just disappear ;-)
At least, not for the "content" field :-)
Once you've got what the user did input in the form (ie, when the form has been submitted), you need to escape it before sending it to the DB.
That's where functions like mysqli_real_escape_string become useful : they escape data for SQL
You might also want to take a look at prepared statements, which might help you a bit ;-)
with mysqli - and with PDO
You should not use anything like addslashes : the escaping it does doesn't depend on the Database engine ; it is better/safer to use a function that fits the engine (MySQL, PostGreSQL, ...) you are working with : it'll know precisely what to escape, and how.
Finally, to display the data inside a page :
for fields that must not contain HTML, you should use htmlspecialchars() : if the user did input HTML tags, those will be displayed as-is, and not injected as HTML.
for fields that can contain HTML... This is a bit trickier : you will probably only want to allow a few tags, and strip_tags (which can do that) is not really up to the task (it will let attributes of the allowed tags)
You might want to take a look at a tool called HTMLPUrifier : it will allow you to specify which tags and attributes should be allowed -- and it generates valid HTML, which is always nice ^^
This might take some time to compute, and you probably don't want to re-generate that HTML each time is has to be displayed ; so you can think about storing it in the database (either only keeping that clean HTML, or keeping both it and the not-clean one, in two separate fields -- might be useful to allow people editing their posts ? )
Those are only a few pointers... hope they help you :-)
Don't hesitate to ask if you have more precise questions !

mysql_real_escape_string() escapes everything you need to put in a mysql database. But you should use prepared statements (in mysqli) instead, because they're cleaner and do any escaping automatically.
Anything else can be done with htmlspecialchars() to remove HTML from the input and urlencode() to put things in a format for URL's.

There are two completely different types of attack you have to defend against:
SQL injection: input that tries to manipulate your DB. mysql_real_escape_string() and addslashes() are meant to defend against this. The former is better, but parameterized queries are better still
Cross-Site scripting (XSS): input that, when displayed on your page, tries to execute JavaScript in a visitor's browser to do all kinds of things (like steal the user's account data). htmlspecialchars() is the definite way to defend against this.
Allowing "some HTML" while avoiding XSS attacks is very, very hard. This is because there are endless possibilities of smuggling JavaScript into HTML. If you decided to do this, the safe way is to use BBCode or Markdown, i.e. a limited set of non-HTML markup that you then convert to HTML, while removing all real HTML with htmlspecialchars(). Even then you have to be careful not to allow javascript: URLs in links. Actually allowing users to input HTML is something you should only do if it's absolutely crucial for your site. And then you should spend a lot of time making sure you understand HTML and JavaScript and CSS completely.

The answer to this post is a good answer
Basically, using the pdo interface to parameterize your queries is much safer and less error prone than escaping your inputs manually.

I have a tendency to escape all characters that would be problematic in page display, Javascript and SQL all at the same time. It leaves it readable on the web and in HTML eMail and at the same time removes any problems with the code.
A vb.NET Line Of Code Would Be:
SafeComment = Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
HttpUtility.HtmlEncode(Trim(strInput)), _
":", ":"), "-", "-"), "|", "|"), _
"`", "`"), "(", "("), ")", ")"), _
"%", "%"), "^", "^"), """", """), _
"/", "/"), "*", "*"), "\", "\"), _
"'", "'")

First of all, general advice: don't escape variables literally when inserting in the database. There are plenty of solutions that let you use prepared statements with variable binding. The reason to not do this explicitly is because it is only a matter of time then before you forget it just once.
If you're inserting plain text in the database, don't try to clean it on insert, but instead clean it on display. That is to say, use htmlentities to encode it as HTML (and pass the correct charset argument). You want to encode on display because then you're no longer trusting that the database contents are correct, which isn't necessarily a given.
If you're dealing with rich text (html), things get more complicated. Removing the "evil" bits from HTML without destroying the message is a difficult problem. Realistically speaking, you'll have to resort to a standardized solution, like HTMLPurifier. However, this is generally too slow to run on every page view, so you'll be forced to do this when writing to the database. You'll also have to ensure that the user can see their "cleaned up" html and correct the cleaned up version.
Definitely try to avoid "rolling your own" filter or encoding solution at any step. These problems are notoriously tricky, and you run a large risk of overlooking some minor detail that has big security implications.

I second Joeri, do not roll your own, go here to see some of the the many possible XSS attacks
http://ha.ckers.org/xss.html
htmlentities() -> turns text into html, converting characters to entities. If using UTF-8 encoding then use htmlspecialchars() instead as the other entities are not needed. This is the best defence against XSS. I use it on every variable I output regardless of type or origin unless I intend it to be html. There is only a tiny performance cost and it is easier than trying to work out what needs escaping and what doesn't.
strip_tags() - turns html into text by removing all html tags. Use this to ensure that there is nothing nasty in your input as a adjunct to escaping your output.
mysql_real_escape_string() - escapes a string for mysql and is your defence against SQL injections from little Bobby tables (better to use mysqli and prepare/bind as escaping is then done for you and you can avoid lots of messy string concatenations)
The advice given obve re avoiding HTML input unless it is essential and opting for BBCode or similar (make your own up if needs be) is very sound indeed.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.