Users entering ampersand & character messing up my sites w3c validation

Users entering ampersand & character messing up my sites w3c validation - php

my social networking site is w3c xhtml valid however users are able to post blog reports and stuff and at times enter in ampersand characters which in turn mess up my validation. How can I fix this and are there any other single characters that I need to look out for that could mess up my validation?

When displaying user produced content, run it through the htmlspecialchars() function.

As a matter of general principle it's a mistake to include user-submitted (or indeed any external) content into your page directly without validation or filtering. Besides causing validation errors it can also cause "broken pages" and large security holes (cross-site scripting attacks).
Whenever you get data from anywhere that isn't 100% trusted, you need to make it safe in some way. You can do this by doing some or all of:
Escaping textual data so that special characters are replaced by the HTML entities that represent them.
Stripping or filtering unsafe HTML tags.
Validating that HTML doesn't contain any unsafe or illegal constructs.
If your user input is meant to be interpreted as text then you're mostly looking at option 1; if you're letting the users use HTML then you're looking at options 2 and 3. A fourth option is to have the users use some more restrictive non-HTML markup such as Markdown or bbCode, translating between that markup and HTML using a library that (hopefully) doesn't allow the injection of security holes, page-breaking constructs, or other scary things.

It's a bad idea to allow users to enter HTML markup.
This enables all kinds of nasty things, most notably cross-site scripting (XSS) exploits and injection of hidden spam (hidden from you, not search engine bots).
You should:
Obliterate all HTML tags using htmlspecialchars() and only preserve newlines with nl2br(). You might allow some formatting by implementing your own safe markup that allows only very specific tags (things like phpBB or Wiki-like markup).
Use HTML Purifier to reliably eliminate all potentially-dangerous markup. PHP's strip_tags() function is fundamentally broken and allows dangerous code in attributes if you use whitelist argument.

Related

Protection against XSS exploits?

I'm newish to PHP but I hear XSS exploits are bad. I know what they are, but how do I protect my sites?

To prevent from XSS attacks, you just have to check and validate properly all user inputted data that you plan on using and dont allow html or javascript code to be inserted from that form.
Or you can you Use htmlspecialchars() to convert HTML characters into HTML entities. So characters like <> that mark the beginning/end of a tag are turned into html entities and you can use strip_tags() to only allow some tags as the function does not strip out harmful attributes like the onclick or onload.

Escape all user data (data in the database from user) with htmlentities() function.
For HTML data (for example from WYSIWYG editors), use HTML Purifier to clean the data before saving it to the database.

strip_tags() if you want to have no tags at all. Meaning anything like <somthinghere>
htmlspecialchars() would covert them to html so the browser will only show and not try to run.
If you want to allow good html i would use something like htmLawed or htmlpurifier

The bad news
Unfortunately, preventing XSS in PHP is a non-trivial undertaking.
Unlike SQL injection, which you can mitigate with prepared statements and carefully selected white-lists, there is no provably secure way to separate the information you are trying to pass to your HTML document from the rest of the document structure.
The good news
However, you can mitigate known attack vectors by being particularly cautious with your escaping (and keeping your software up-to-date).
The most important rule to keep in mind: Always escape on output, never on input. You can safely cache your escaped output if you're concerned about performance, but always store and operate on the unescaped data.
XSS Mitigation Strategies
In order of preference:
If you are using a templating engine (e.g. Twig, Smarty, Blade), check that it offers context-sensitive escaping. I know from experience that Twig does. {{ var|e('html_attr') }}
If you want to allow HTML, use HTML Purifier. Even if you think you only accept Markdown or ReStructuredText, you still want to purify the HTML these markup languages output.
Otherwise, use htmlentities($var, ENT_QUOTES | ENT_HTML5, $charset) and make sure the rest of your document uses the same character set as $charset. In most cases, 'UTF-8' is the desired character set.
Why shouldn't I filter on input?
Attempting to filter XSS on input is premature optimization, which can lead to unexpected vulnerabilities in other places.
For example, a recent WordPress XSS vulnerability employed MySQL column truncation to break their escaping strategy and allow the prematurely escaped payload to be stored unsafely. Don't repeat their mistake.

How to prevent XSS attack with Zend Form using %

our company has made a website for our client. The client hired a webs security company to test the pages for security before the product launches.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
They ran another test and it still failed :(
They used the following for the one input field in the data of the http header: name=%3Cscript%3Ealert%28123%29%3C%2Fscript%3E which basically translates to name=<script>alert(123);</script>
I've added alpha and alnum to some of the fields, which fixes the XSS vulnerability (touch wood) by removing the %, however, now the boss don't like it because what of O'Brien and double-barrel surnames...
I haven't come across the %3C as < problem reading up about XSS. Is there something wrong with my html character set or encoding or something?
I probably now have to write a custom filter, but that would be a huge pain to do that with every website and deployment. Please help, this is really frustrating.
EDIT:
if it's about escaping the form's output, how do I do that? The form submits to the same page - how do I escape if I only have in my view <?= $this->form ?>
How can I get Zend Form to escape it's output?

%3Cscript%3Ealert%28123%29%3C%2Fscript%3E is the URL-encoded form of <script>alert(123);</script>. Any time you include < in a form value, it will be submitted to the server as %3C. PHP will read and decode that back to < before anything in your application gets a look at it.
That is to say, there is no special encoding that you have to handle; you won't actually see %3C in your input, you see <. If you're failing to encode that for on-page display then you don't have even the most basic defenses against XSS.
We've removed most of our XSS problems. We developed the website with zend. We add the StripTags, StringTrim and HtmlEntities filters to the order form elements.
I'm afraid you have not fixed your XSS problems at all. You may have merely obfuscated them.
Input filtering is a depressingly common but quite wrong strategy for blocking XSS.
It is not the input that's the problem. As your boss says, there is no reason you shouldn't be able to input O'Brien. Or even <script>, like I am just now in this comment box. You should not attempt to strip tags in the input or even HTML-encode them, because who knows at input-time that the data is going to end up in an HTML page? You don't want your database filled with nonsense like 'Fish&Chips' which then ends up in an e-mail or other non-HTML context with weird HTML escapes in it.
HTML-encoding is an output-stage issue. Leave the incoming strings alone, keep them as raw strings in the database (of course, if you are hacking together queries in strings to put the data in the database instead of parameterised queries, you would need to SQL-escape the content at exactly that point). Then only when you are inserting the values in HTML, encode them:
Name: <?php echo htmlspecialchars($row['name']); ?>
If you have a load of dodgy code like echo "Name: $name"; then I'm afraid you have much rewriting to do to make it secure.
Hint: consider defining a function with a short name like h so you don't have to type htmlspecialchars so much. Don't use htmlentities which will usually-unnecessarily encode non-ASCII characters, which will also mess them up unless you supply a correct $charset argument.
(Or, if you are using Zend_View, $this->escape().)
Input validation is useful on an application-specific level, for things like ensuring telephone number fields contain numbers and not letters. It is not something you can apply globally to avoid having to think about the issues that arise when you put a string inside the context of another string—whether that's inside HTML, SQL, JavaScript string literals or one of the many other contexts that require escaping.

If you correctly escape strings every time you write them to the HTML page, you won't have any issues.
%3C is a URL-encoded <; it is decoded by the server.

What characters to strip from messages?

I'm quite surprised I haven't been able to find out what characters I need to strip from a message in order to keep my application safe.
I've got a php app, and most of the inputs are numerical, but I'm adding the ability for users to attache messages, so I need to cleanse the message and strip any characters that could be a threat.
My initial reaction was if I did
$message=addslashes(preg_replace('/[^a-zA-Z0-9\-,& $%\(\)##!\'\"?.]/','',$_POST['message']));
I'd be safe, but I haven't been able to find anything which states what characters can be damaging, and what characters would be safe.

I would say that you don't have to strip any characters from your input, at least generally speaking.
Instead, you must escape your data :
when sending it to your database
see mysql_real_escape_string, mysqli_real_escape_string, PDO::quote
or Prepared statements : MySQLi ; PDO
when sending it to the HTML output
see htmlspecialchars
Still, if you allow users to input HTML, you should take a look at HTMLPurifier, to make sure they are not able to inject any malicious HTML code into your web-pages :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are
standards compliant

This is where HTML Purifier comes in handy.

Instead of sanitizing your data just use Prepared Statements for database interaction. PDOs eliminate the need of hand santizing all of your input yourself.
PHP Manual

HTML Purifier - what to purify?

I am using HTML Purifier to protect my application from XSS attacks. Currently I am purifying content from WYSIWYG editors because that is the only place where users are allowed to use XHTML markup.
My question is, should I use HTML Purifier also on username and password in a login authentication system (or on input fields of sign up page such as email, name, address etc)? Is there a chance of XSS attack there?

You should Purify anything that will ever possibly be displayed on a page. Because with XSS attacks, hackers put in <script> tags or other malicious tags that can link to other sites.
Passwords and emails should be fine. Passwords should never be shown and emails should have their own validator to make sure that they are in the proper format.
Finally, always remember to put in htmlentities() on content.
Oh .. and look at filter_var aswell. Very nice way of filtering variables.

XSS risks exist where ever data entered by one user may be viewed by other users. Even if this data isn't currently viewable, don't assume that a need to do this won't arise.
As far as the username and password go, you should never display a password, or even store it in a form that can be displayed (i.e. encyrpt it with sha1()). For usernames, have a restriction on legal characters like [A-Za-z0-9_]. Finally, as the other answer suggests, use your languages html entity encoding function for any entered data that may contain reserved or special html characters, which prevents this data from causing syntax errors when displayed.

No, I wouldn't use HTMLPurifier on username and password during login authentication. In my appllications I use alphanumeric usernames and an input validation filter and display them with htmlspecialchars with ENT_QUOTES. This is very effective and a hell lot faster than HTMLpurifier. I'm yet to see an XSS attack using alphanumeric string. And BTW HTMLPurifier is useless when filtering alphanumeric content anyway so if you force the input string through an alphanumeric filter then there is no point to display it with HTMLpurifier. When it comes to passwords they should never be displayed to anybody in the first place which eliminates the possibility of XSS. And if for some perverse reason you want to display the passwords then you should design your application in such a way that it allows only the owner of the password to be able to see it, otherwise you are screwed big time and XSS is the least of your worry!

HTML Purifier takes HTML as input, and produces HTML as output. Its purpose is to allow the user to enter html with some tags, attributes, and values, while filtering out others. This uses a whitelist to prevent any data that can contain scripts. So this is useful for something like a WYSIWYG editor.
Usernames and passwords on the other hand are not HTML. They're plain text, so HTML purifier is not an option. Trying to use HTML Purifier here would either corrupt the data, or allow XSS attacks.
For example, it lets the following through unchanged, which can cause XSS issues when inserted as an attribute value in some elements:
" onclick="javascript:alert()" href="
Or if someone tried to use special symbols in their password, and entered:
<password
then their password would become blank, and make it much easier to guess.
Instead, you should encode the text. The encoding required depends on the context, but you can use htmlentities when outputting these values if you stick to rule #0 and rule #1, at the OWASP XSS Prevention Cheat Sheet

Comprehensive server-side validation

I currently have a fairly robust server-side validation system in place, but I'm looking for some feedback to make sure I've covered all angles. Here is a brief outline of what I'm doing at the moment:
Ensure the input is not empty, or is too long
Escape query strings to prevent SQL injection
Using regular expressions to reject invalid characters (this depends on what's being submitted)
Encoding certain html tags, like <script> (all tags are encoded when stored in a database, with some being decoded when queried to render in the page)
Is there anything I'm missing? Code samples or regular expressions welcome.

You shouldn't need to "Escape" query strings to prevent SQL injection - you should be using prepared statements instead.
Ideally your input filtering will happen before any other processing, so you know it will always be used. Because otherwise you only need to miss one spot to be vulnerable to a problem.
Don't forget to encode HTML entities on output - to prevent XSS attacks.

You should encode every html tag, not only 'invalid' ones. This is a hot debate, but basically it boils down to there will always be some invalid HTML combination that you will forget to handle correctly (nested tags, mismatched tags some browsers interpret 'correctly' and so on). So the safest option in my opinion is to store everything as htmlentities and then, on output, print a validated HTML-safe-subset tree (as entities) from the content.

Run all server-side validation in a library dedicated to the task so that improvements in one area affect all of your application.
Additionally include work against known attacks, such as directory traversal and attempts to access the shell.

This Question/Answer has some good responses that you're looking for
(PHP-oriented, but then again you didn't specify language/platform and some of it applies beyond the php world):
What's the best method for sanitizing user input with PHP?

You might check out the Filter Extension for data filtering. It won't guarantee that you're completely airtight, but personally I feel a lot better using it because that code has a whole lot of eyeballs looking over it.
Also, consider prepared statements seconded. Escaping data in your SQL queries is a thing of the past.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.