What is the best way to sanitize user inputs?

What is the best way to sanitize user inputs? - php

I need to prevent XSS attacks as much as possible and in a centralized way so that I don't have to explicitly sanitize each input.
My question is it better to sanitize all inputs at URL/Request processing level, encode/sanitize inputs before serving, or at the presentation level (output sanitization)?
Which one is better and why?

There are two areas where you need to be aware:
Anywhere where you use input as part of a script in any language, most notably including SQL. In the particular case of SQL, the only recommended way of dealing with things is the use of parameterized queries (which will result in unescaped content being in the database, but just as strings: that's ideal). Anything involving the magic quoting of characters before substituting them directly into the SQL string is inferior (because it's so easy to get wrong). Anything that can't be done with a parameterized query is something that a service secured against SQL-injection should never allow a user to specify.
Anywhere where you present something that was input as output. The source of the input could be direct (including via a cookie) or indirect (via the database or a file). In this case, your default approach should be to make the text that the user sees be the text that was input. That's very easy to implement correctly since the only characters you actually have to quote are < and &, and you can wrap it all in <pre> for display.
But that's often not enough. For example, you might want to allow users to do some sort of formatting. This is where it is ever so easy to go wrong. The simplest approach in this case is to parse the input and detect all the formatting instructions; everything else needs to be quoted properly. You should store the formatted version additionally in the database as an extra column so that you don't need to do much work when returning it to the user, but you should also store the original version that the user input so you can search over it. Do not mix them up! Really! Audit your application to make totally sure that you get this right (or, better yet, get someone else to do the audit).
But everything about being careful with SQL still applies, and there are many HTML tags (e.g., <script>, <object>) and attributes (e.g., onclick) that are never ever safe.
You were looking for advice about specific packages to do the work? You really need to pick a language then. The above advice is all totally language-independent. Add-on packages/libraries can make many of the steps above really easy in practice, but you still absolutely need to be careful.

Related

Sanitizing data before storing it might mean stored data is different to what the user entered - is that common practice?

The Background
HTML form, e.g. for a user to submit their business details which will later appear on a legal document - so data needs to be precise.
Submits to a PHP script that validates all inputs.
If all inputs are valid, it sanitizes the data and writes it to a database using parameterised queries.
If any of the inputs are invalid, it re-displays the form. My feeling is that the user would expect this form to be populated with what they originally typed in with some feedback on what is wrong with their input. They can then amend their input and re-submit the form. This means the form needs to be populated with unsanitized data (this will be escaped before displaying it).
All good so far.
The Problem
If the data is valid, it is written to a database. Best practice seems to be to sanitize the data before sending it to the database.
This means the data written to the database might not be exactly what the user typed in (e.g. if sanitization removes some "dangerous" characters).
This seems like a poor user experience to me.
I'm using PHP and the code is running within the WordPress framework. WP has its own sanitization functions and they recommend always sanitizing input before using it. They also suggest using PHP's santization features too. But nothing seems to address the issue that sanitizing data before storing it might result in saved data being different from what the user entered.
The Question
What I'd like is a description of an approach that's been used in the real world that addresses this issue? Or some feedback from those of you more experience than I am, that this is not a problem in the real world and it's common practice just to sanitize data and store it without further concern or feedback to the user.
My thoughts about possible solutions
A more thorough pattern would be to consider unsanitary data as invalid and feedback to the user what is wrong with their input. But this seems impractical and would require fairly long sanitization functions to provide any specific and useful feedback to the user. It also renders existing WP/PHP sanitization functions somewhat irrelevant.
A practical compromise may be to compare sanitized data with raw data and then simply notify the user that something got cleaned up before it was saved... so they can at least check the saved data to make sure they're happy with it.
Thanks for your help.
Conclusions
The answer I've accepted was helpful and lead me to a solution to my particular use case, but I wanted to add a few points of my own.
Firstly, on re-reading the WP documentation I found that it's not recommending to validate AND then sanitize before writing to a database. It recommends to validate, but suggests sanitizing the input might be more convenient if the particular situation does not require strict validation. It also says use one or the other, not both. So I don't think the WP documentation is wrong on this, I just misread it.
Secondly, I didn't understand that parameterized queries are so effective against SQL injection. So I figured that sanitizing input before using it in a DB query was a sensible thing to do. But it seems it's not necessary.
And finally, I now realise that it's all about context... the issue is making data safe for a particular use. In that sense, it's not that one technique is only appropriate for input and another technique is only appropriate for output. I need to think about validating, sanitizing or escaping when doing anything with the data - e.g. write it to a DB, use it in a calculation, print it to screen, or inject it into a PDF document. And in all cases, I just need to think about how I make it safe for that particular use. Sanitizing "input" might be entirely appropriate - if it's quick and easy, makes the data safe for whatever I need to do and doesn't render the data inaccurate. Another example is the WordPress function esc_url_raw() which the manual says is specifically to be used when storing a URL in the database. So again, the idea that escaping is only appropriate for "output" is misleading.
I ended up validating the input before writing it to the database. I did not need to sanitize it aswell. So I if it's invalid, I tell the user. If it's valid, it gets written to the DB in its original form. And I escape it before displaying it back to the user.

Best practice seems to be to sanitize the data before sending it to the database.
This is a common misconception. Sanitization should only be performed on data that is being output, to prevent XSS for example and even then only as a last resort. Exactly because it can irreversibly destroy the original data.
Validation is your first line of defense. Make sure that the data is properly formatted, and valid within its context - just that; no looking for special characters, don't be over-zealous. If it's not valid - reject it, don't try to salvage the "good" parts from it.
Then, when storing in database, you merely need to use parameterized queries - that is 100% effective against SQL injections. If you didn't mangle the data in a previous step, you're storing it in its original form.
And finally, when the data is being output, that is where you SHOULD escape special characters within the appropriate context, so that it is properly rendered; or sanitize it if you have no other choice (i.e. the context is unclear and therefore you can't do proper escaping).

It looks like you are worried about user feelings, that's good. There are few things which you can do.
Use html form pattern - for sure no one name needs signs like < > & $ " ... - exclude this with pattern, use css :invalid and :invalid:focus to inform the user before submitting if something is wrong. It is very easy and simple.
Than goes php further validation and WP sanitation.
You can use intermediate state - after 'wash' - display final version (no inputs) with 2 buttons - save or correction - let the user decide, most of us don't like this repetitions "are you sure? clicking submit you mean submit?" - but maybe with so relevant content, users would like to have last chance, and they wish to see the final version (without inputs, checkboxes etc).
Now you put accepted version into db (prepared).
Comparing raw data with washed is not practical, honestly it sucks - the users won't be coders - they just won't be able to correctly understand "we sanitised your answers, and now they are 345 characters shorter. Sorry for inconvenience"
Don't worry too much
...there is a german last name 'Ei' - only 2 characters, so pattern can't require more than 2.

Sanitizing PHP Variables, am I overusing it?

I've been working with PHP for some time and I began asking myself if I'm developing good habits.
One of these is what I belive consists of overusing PHP sanitizing methods, for example, one user registers through a form, and I get the following post variables:
$_POST['name'], $_POST['email'] and $_POST['captcha']. Now, what I usually do is obviously sanitize the data I am going to place into MySQL, but when comparing the captcha, I also sanitize it.
Therefore I belive I misunderstood PHP sanitizing, I'm curious, are there any other cases when you need to sanitize data except when using it to place something in MySQL (note I know sanitizing is also needed to prevent XSS attacks). And moreover, is my habit to sanitize almost every variable coming from user-input, a bad one ?

Whenever you store your data someplace, and if that data will be read/available to (unsuspecting) users, then you have to sanitize it. So something that could possibly change the user experience (not necessarily only the database) should be taken care of. Generally, all user input is considered unsafe, but you'll see in the next paragraph that some things might still be ignored, although I don't recommend it whatsoever.
Stuff that happens on the client only is sanitized just for a better UX (user experience, think about JS validation of the form - from the security standpoint it's useless because it's easily avoidable, but it helps non-malicious users to have a better interaction with the website) but basically, it can't do any harm because that data (good or bad) is lost as soon as the session is closed. You can always destroy a webpage for yourself (on your machine), but the problem is when someone can do it for others.
To answer your question more directly - never worry about overdoing it. It's always better to be safe than sorry, and the cost is usually not more than a couple of milliseconds.

The term you need to search for is FIEO. Filter Input, Escape Output.
You can easily confound yourself if you do not understand this basic principle.
Imagine PHP is the man in the middle, it receives with the left hand and doles out with the right.
A user uses your form and fills in a date form, so it should only accept digits and maybe, dashes. e.g. nnnnn-nn-nn. if you get something which does not match that, then reject it.
That is an example of filtering.
Next PHP, does something with it, lets say storing it in a Mysql database.
What Mysql needs is to be protected from SQL injection, so you use PDO, or Mysqli's prepared statements to make sure that EVEN IF your filter failed you cannot permit an attack on your database. This is an example of Escaping, in this case escaping for SQL storage.
Later, PHP gets the data from your db and displays it onto a HTML page. So you need to Escape the data for the next medium, HTML (this is where you can permit XSS attacks).
In your head you have to divide each of the PHP 'protective' functions into one or other of these two families, Filtering or Escaping.
Freetext fields are of course more complex than filtering for a date, but never mind, stick to the principles and you will be OK.
Hoping this helps http://phpsec.org/projects/guide/

PHP user input data security

I am trying to figure out which functions are best to use in different cases when inputting data, as well as outputting data.
When I allow a user to input data into MySQL what is the best way to secure the data to prevent SQL injections and or any other type of injections or hacks someone could attempt?
When I output the data as regular html from the database what is the best way to do this so scripts and such cannot be run?
At the moment I basically only use
mysql_real_escape_string();
before inputting the data to the database, this seems to work fine, but I would like to know if this is all I need to do, or if some other method is better.
And at the moment I use
stripslashes(nl2br(htmlentities()))
(most of the time anyways) for outputting data. I find these work fine for what I usually use them for, however I have run into a problem with htmlentities, I want to be able to have some html tags output respectively, for example:
<ul></ul><li></li><bold></bold>
etc, but I can't.
any help would be great, thanks.

I agree with mikikg that you need to understand SQL injection and XSS vulnerabilities before you can try to secure applications against these types of problems.
However, I disagree with his assertions to use regular expressions to validate user input as a SQL injection preventer. Yes, do validate user input insofar as you can. But don't rely on this to prevent injections, because hackers break these kinds of filters quite often. Also, don't be too strict with your filters -- plenty of websites won't let me log in because there's an apostrophe in my name, and let me tell you, it's a pain in the a** when this happens.
There are two kinds of security problems you mention in your question. The first is a SQL injection. This vulnerability is a "solved problem." That is, if you use parameterized queries, and never pass user supplied data in as anything but a parameter, the database is going to do the "right thing" for you, no matter what happens. For many databases, if you use parameterized queries, there's no chance of injection because the data isn't actually sent embedded in the SQL -- the data is passed unescaped in a length prefixed or similar blob along the wire. This is considerably more performant than database escape functions, and can be safer. (Note: if you use stored procedures that generate dynamic SQL on the database, they might also have injection problems!)
The second problem you mention is the cross site scripting problem. If you want to allow the user to supply HTML without entity escaping it first, this problem is an open research question. Suffice to say that if you allow the user to pass some kinds of HTML, it's entirely likely that your system will suffer an XSS problem at some point to a determined attacker. Now, the state of the art for this problem is to "filter" the data on the server, using libraries like HTMLPurifier. Attackers can and do break these filters on a regular basis; but as of yet nobody has found a better way of protecting the application from these kinds of things. You may be better off only allowing a specific whitelist of HTML tags, and entity encoding anything else.

This is one of the most problematic task today :)
You need to know how SQL injection and other attackers methods works. There are very detailed explanation of each method in https://www.owasp.org/index.php/Main_Page and also whole security framework for PHP.
Using specific security libraries from some framework are also good choice like in CodeIgniter or Zend.
Next, use REGEXP as much as you can and stick pattern rules to specific input format.
Use prepared statements or active records class of your framework.
Always cast your input with (int)$_GET['myvar'] if you really need numeric values.
There are so many other rules and methods to secure your application, but one golden rule is "never trust user's input".

In your php configuration, magic_quotes_gpc should be off. So you won't need stripslashes.
For SQL, take a look at PDO's prepared statements.
And for your custom tags, as there are only three of them, you can do a preg_replace call after the call of htmlentities to convert those back before your insert them into the database.

doubt regarding storing values in mysql database

is it better to save the exact user input in the database or clean it for xss and store ..
or is it good to store the direct user input , clean it and display it ??
please guide me .
Thanks

I would say it is better to store the exact data in the database and correctly escape it when you need to display it. This will make things much easier if you later want to display it in using a different medium where the dangerous characters and escaping might be different.
There are also a few other problems with relying on custom "cleaning" functions instead of using the escaping functions provided by the standard library for your language.
Unnecessary Restrictions - If, for example, you always remove <script> tags people won't be able to talk about <script> tags on your site, like I did just now. That might be fine for some sites, but not others.
Subtle bugs - If you writing your own "cleaning" function you might miss some dangerous input that you hadn't considered. An example is replacing <script> with an empty string, but forgetting that the user could enter <scri<script>pt> which after the replacement will become <script>. Using the built-in escaping functions generally will work correctly as they have (hopefully) been written by experienced programmers, tested well and used in thousands of other systems where security is important.
Special Cases - If you decide to clean all your input by for example removing '<' and '>' in all strings before storing them you will probably find out sooner or later that at least one specific field can't be cleaned because those characters are absolutely necessary in that one field, so you have to escape it instead. Now you have created a situation where you have to remember whether or not you should apply escaping to your data. This increases the chance of getting it wrong, and makes it difficult to see at a glance in your code whether you've forgotten to escape or whether its one of the fields where escaping is not necessary.

I'm learning PHP on my own and I've become aware of the strip_tags() function. Is this the only way to increase security?

I'm new to PHP and I'm following a tutorial here:
Link
It's pretty scary that a user can write php code in an input and basically screw your site, right?
Well, now I'm a bit paranoid and I'd rather learn security best practices right off the bat than try to cram them in once I have some habits in me.
Since I'm brand new to PHP (literally picked it up two days ago), I can learn pretty much anything easily without getting confused.
What other way can I prevent shenanigans on my site? :D

There are several things to keep in mind when developing a PHP application, strip_tags() only helps with one of those. Actually strip_tags(), while effective, might even do more than needed: converting possibly dangerous characters with htmlspecialchars() should even be preferrable, depending on the situation.
Generally it all comes down to two simple rules: filter all input, escape all output. Now you need to understand what exactly constitutes input and output.
Output is easy, everything your application sends to the browser is output, so use htmlspecialchars() or any other escaping function every time you output data you didn't write yourself.
Input is any data not hardcoded in your PHP code: things coming from a form via POST, from a query string via GET, from cookies, all those must be filtered in the most appropriate way depending on your needs. Even data coming from a database should be considered potentially dangerous; especially on shared server you never know if the database was compromised elsewhere in a way that could affect your app too.
There are different ways to filter data: white lists to allow only selected values, validation based on expcted input format and so on. One thing I never suggest is try fixing the data you get from users: have them play by your rules, if you don't get what you expect, reject the request instead of trying to clean it up.
Special attention, if you deal with a database, must be paid to SQL injections: that kind of attack relies on you not properly constructing query strings you send to the database, so that the attacker can forge them trying to execute malicious instruction. You should always use an escaping function such as mysql_real_escape_string() or, better, use prepared statements with the mysqli extension or using PDO.
There's more to say on this topic, but these points should get you started.
HTH
EDIT: to clarify, by "filtering input" I mean decide what's good and what's bad, not modify input data in any way. As I said I'd never modify user data unless it's output to the browser.

strip_tags is not the best thing to use really, it doesn't protect in all cases.
HTML Purify:
http://htmlpurifier.org/
Is a real good option for processing incoming data, however it itself still will not cater for all use cases - but it's definitely a good starting point.

I have to say that the tutorial you mentioned is a little misleading about security:
It is important to note that you never want to directly work with the $_GET & $_POST values. Always send their value to a local variable, & work with it there. There are several security implications involved with the values when you directly access (or
output) $_GET & $_POST.
This is nonsense. Copying a value to a local variable is no more safe than using the $_GET or $_POST variables directly.
In fact, there's nothing inherently unsafe about any data. What matters is what you do with it. There are perfectly legitimate reasons why you might have a $_POST variable that contains ; rm -rf /. This is fine for outputting on an HTML page or storing in a database, for example.
The only time it's unsafe is when you're using a command like system or exec. And that's the time you need to worry about what variables you're using. In this case, you'd probably want to use something like a whitelist, or at least run your values through escapeshellarg.
Similarly with sending queries to databases, sending HTML to browsers, and so on. Escape the data right before you send it somewhere else, using the appropriate escaping method for the destination.

strip_tags removes every piece of html. more sophisticated solutions are based on whitelisting (i.e. allowing specific html tags). a good whitelisting library is htmlpurifyer http://htmlpurifier.org/
and of course on the database side of things use functions like mysql_real_escape_string or pg_escape_string

Well, probably I'm wrong, but... In all literature, I've read, people say It's much better to use htmlspellchars.
Also, rather necessary to cast input data. (for int for example, if you are sure it's user id).
Well, beforehand, when you'll start using database - use mysql_real_escape_string instead of mysql_escape_string to prevent SQL injections (in some old books it's written mysql_escape_string still).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.