input sanitization VS validation

input sanitization VS validation - php

I've implemented input validation on all of my input data using php (as well as js on the front-end). I'm type casting where I can, validating stuff like emails against a regex, making sure dropdown values are only ones I'm expecting and also in many cases where I'm expecting only a string I have a regex that runs that only allows letters, numbers and spaces. Anything that doesn't meet these rules results in the form failing validation and no sql queries are run.
With that said if my form passes validation I'm making the assumption that it's safe for input in to my db (which I'm doing via pdo) and then escaped on output.
So with that said why do I need input sanitization?

If you have very strict validation server-side, you don't need to sanatize. Eg. validating a string against /^[a-z0-9]{5,25}$/ will not need any sanitization (removing non alphanumeric characters will not make any sense, since they should not be able to pass anyway).
Just make sure you can validate all data, and if that's impossible (e.g. with html it tends to be a bit difficult), you can use escaping strategies or things like html purifier.
For a good overview on escaping strategies for XSS prevention:
see https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet
For an idea of different security threats:
https://www.owasp.org/index.php/PHP_Security_Cheat_Sheet

You need both. Validating input data is easily beaten at the client side, but it's useful for legitimate users who aren't trying to hack you.
Sanitize the data (all the data, whether it's input data or something straight from your DB that you think you should be able to trust) before putting it into your database.
Even if you 100% trust your validation and do it on the server side (where, in theory, people shouldn't be able to mess with the data), it's still worth using some form of sanitizing because it's a good habit to get into.

Related

Sanitizing data before storing it might mean stored data is different to what the user entered - is that common practice?

The Background
HTML form, e.g. for a user to submit their business details which will later appear on a legal document - so data needs to be precise.
Submits to a PHP script that validates all inputs.
If all inputs are valid, it sanitizes the data and writes it to a database using parameterised queries.
If any of the inputs are invalid, it re-displays the form. My feeling is that the user would expect this form to be populated with what they originally typed in with some feedback on what is wrong with their input. They can then amend their input and re-submit the form. This means the form needs to be populated with unsanitized data (this will be escaped before displaying it).
All good so far.
The Problem
If the data is valid, it is written to a database. Best practice seems to be to sanitize the data before sending it to the database.
This means the data written to the database might not be exactly what the user typed in (e.g. if sanitization removes some "dangerous" characters).
This seems like a poor user experience to me.
I'm using PHP and the code is running within the WordPress framework. WP has its own sanitization functions and they recommend always sanitizing input before using it. They also suggest using PHP's santization features too. But nothing seems to address the issue that sanitizing data before storing it might result in saved data being different from what the user entered.
The Question
What I'd like is a description of an approach that's been used in the real world that addresses this issue? Or some feedback from those of you more experience than I am, that this is not a problem in the real world and it's common practice just to sanitize data and store it without further concern or feedback to the user.
My thoughts about possible solutions
A more thorough pattern would be to consider unsanitary data as invalid and feedback to the user what is wrong with their input. But this seems impractical and would require fairly long sanitization functions to provide any specific and useful feedback to the user. It also renders existing WP/PHP sanitization functions somewhat irrelevant.
A practical compromise may be to compare sanitized data with raw data and then simply notify the user that something got cleaned up before it was saved... so they can at least check the saved data to make sure they're happy with it.
Thanks for your help.
Conclusions
The answer I've accepted was helpful and lead me to a solution to my particular use case, but I wanted to add a few points of my own.
Firstly, on re-reading the WP documentation I found that it's not recommending to validate AND then sanitize before writing to a database. It recommends to validate, but suggests sanitizing the input might be more convenient if the particular situation does not require strict validation. It also says use one or the other, not both. So I don't think the WP documentation is wrong on this, I just misread it.
Secondly, I didn't understand that parameterized queries are so effective against SQL injection. So I figured that sanitizing input before using it in a DB query was a sensible thing to do. But it seems it's not necessary.
And finally, I now realise that it's all about context... the issue is making data safe for a particular use. In that sense, it's not that one technique is only appropriate for input and another technique is only appropriate for output. I need to think about validating, sanitizing or escaping when doing anything with the data - e.g. write it to a DB, use it in a calculation, print it to screen, or inject it into a PDF document. And in all cases, I just need to think about how I make it safe for that particular use. Sanitizing "input" might be entirely appropriate - if it's quick and easy, makes the data safe for whatever I need to do and doesn't render the data inaccurate. Another example is the WordPress function esc_url_raw() which the manual says is specifically to be used when storing a URL in the database. So again, the idea that escaping is only appropriate for "output" is misleading.
I ended up validating the input before writing it to the database. I did not need to sanitize it aswell. So I if it's invalid, I tell the user. If it's valid, it gets written to the DB in its original form. And I escape it before displaying it back to the user.

Best practice seems to be to sanitize the data before sending it to the database.
This is a common misconception. Sanitization should only be performed on data that is being output, to prevent XSS for example and even then only as a last resort. Exactly because it can irreversibly destroy the original data.
Validation is your first line of defense. Make sure that the data is properly formatted, and valid within its context - just that; no looking for special characters, don't be over-zealous. If it's not valid - reject it, don't try to salvage the "good" parts from it.
Then, when storing in database, you merely need to use parameterized queries - that is 100% effective against SQL injections. If you didn't mangle the data in a previous step, you're storing it in its original form.
And finally, when the data is being output, that is where you SHOULD escape special characters within the appropriate context, so that it is properly rendered; or sanitize it if you have no other choice (i.e. the context is unclear and therefore you can't do proper escaping).

It looks like you are worried about user feelings, that's good. There are few things which you can do.
Use html form pattern - for sure no one name needs signs like < > & $ " ... - exclude this with pattern, use css :invalid and :invalid:focus to inform the user before submitting if something is wrong. It is very easy and simple.
Than goes php further validation and WP sanitation.
You can use intermediate state - after 'wash' - display final version (no inputs) with 2 buttons - save or correction - let the user decide, most of us don't like this repetitions "are you sure? clicking submit you mean submit?" - but maybe with so relevant content, users would like to have last chance, and they wish to see the final version (without inputs, checkboxes etc).
Now you put accepted version into db (prepared).
Comparing raw data with washed is not practical, honestly it sucks - the users won't be coders - they just won't be able to correctly understand "we sanitised your answers, and now they are 345 characters shorter. Sorry for inconvenience"
Don't worry too much
...there is a german last name 'Ei' - only 2 characters, so pattern can't require more than 2.

SQL preventation of XSS

Hey guys so Ive got a question, is there a something I could use when inserting data into the SQL to prevent XSS? Instead of when reading it.
For example I have quite bit of outputs from my sql that are user generated, is it possible to just make that safe on Entering SQL or do I have to make it safe when it leaves SQL?
TL:DR can I use something like htmlspecialchars when inserting data into SQL to prevent XSS, will that be any sort of good protection?

I think several things are mixed up in the question.
Preventing XSS with input validation
In general you can't prevent XSS with input validation, except very special cases when you can validate input for something verz strict like numbers only.
Consider this html page (let's imagine <?= is used to insert data into your html in your server-side language because you hinted at PHP, could of course differ by language used):
<script>
var myVar = <?= var1 ?>;
</script>
In this case, var1 on the server doesn't need to have any special character, only letters are enough to inject javascript. Whether that can be useful for an attacker depends on several things, but technically, this would be vulnerable to XSS with almost any input validation. Of course such assignment may not currently be in your Javascript, but how will you ensure that there never will be?
Another example is obviously DOM XSS, where input does not ever get to the server, but that's a different story.
Preventing XSS is an output encoding thing. Input validation may help in some cases, but will not provide sufficient protection in most cases.
Storing encoded values
It is generally not a good idea to store values html-encoded in your database. On the one hand, it makes searching, ordering, any kind of processing much more cumbersome. On the other hand, it violates single responsibility and separation of concerns. Encoding is a view-level thing, your backend database has nothing to do with how you will want to present that data. It's even more emphasized when you consider different encodings. HTML encoding is only ok if you want to write the data into an HTML context. If it's javascript (in a script tag, or in an on* attribute like onclick, or several other places), html encoding is not sufficient, let alone where you have more special outputs. Your database doesn't need to know, where the data will be used, it's an output thing, and as such, it should be handled by views.

You should test the input for whitelist characters using a regex to only accept like [a-Z][0-9] for example. You'll have a big headache if you try the other way around, using a blacklist, because there are gigantic ways of exploiting input and catching them all is a big problem
Also, be aware of SqlInjections. You should use SqlMap on linux to test if your website is vulnerable

Sanitizing PHP Variables, am I overusing it?

I've been working with PHP for some time and I began asking myself if I'm developing good habits.
One of these is what I belive consists of overusing PHP sanitizing methods, for example, one user registers through a form, and I get the following post variables:
$_POST['name'], $_POST['email'] and $_POST['captcha']. Now, what I usually do is obviously sanitize the data I am going to place into MySQL, but when comparing the captcha, I also sanitize it.
Therefore I belive I misunderstood PHP sanitizing, I'm curious, are there any other cases when you need to sanitize data except when using it to place something in MySQL (note I know sanitizing is also needed to prevent XSS attacks). And moreover, is my habit to sanitize almost every variable coming from user-input, a bad one ?

Whenever you store your data someplace, and if that data will be read/available to (unsuspecting) users, then you have to sanitize it. So something that could possibly change the user experience (not necessarily only the database) should be taken care of. Generally, all user input is considered unsafe, but you'll see in the next paragraph that some things might still be ignored, although I don't recommend it whatsoever.
Stuff that happens on the client only is sanitized just for a better UX (user experience, think about JS validation of the form - from the security standpoint it's useless because it's easily avoidable, but it helps non-malicious users to have a better interaction with the website) but basically, it can't do any harm because that data (good or bad) is lost as soon as the session is closed. You can always destroy a webpage for yourself (on your machine), but the problem is when someone can do it for others.
To answer your question more directly - never worry about overdoing it. It's always better to be safe than sorry, and the cost is usually not more than a couple of milliseconds.

The term you need to search for is FIEO. Filter Input, Escape Output.
You can easily confound yourself if you do not understand this basic principle.
Imagine PHP is the man in the middle, it receives with the left hand and doles out with the right.
A user uses your form and fills in a date form, so it should only accept digits and maybe, dashes. e.g. nnnnn-nn-nn. if you get something which does not match that, then reject it.
That is an example of filtering.
Next PHP, does something with it, lets say storing it in a Mysql database.
What Mysql needs is to be protected from SQL injection, so you use PDO, or Mysqli's prepared statements to make sure that EVEN IF your filter failed you cannot permit an attack on your database. This is an example of Escaping, in this case escaping for SQL storage.
Later, PHP gets the data from your db and displays it onto a HTML page. So you need to Escape the data for the next medium, HTML (this is where you can permit XSS attacks).
In your head you have to divide each of the PHP 'protective' functions into one or other of these two families, Filtering or Escaping.
Freetext fields are of course more complex than filtering for a date, but never mind, stick to the principles and you will be OK.
Hoping this helps http://phpsec.org/projects/guide/

Storing form data in database

Helllo friends
I have developed a form.Which allows the user to store there data.now when i am storing the data wat all care i must take so that my any wrong values are not inserted.Or it is not hacked

What you're asking about is called input validation, and there's a lot of information about it out there.
There are primarily two parts:
Making sure the user put in something useful.
Making sure the user didn't put in something harmful.
The former is most often done via JavaScript on the client side (for a generally smoother user experience and fewer postbacks). It should be re-done on the server side as well just to make sure, since you should never trust user input. Basically it involves things like regular expressions to check the format of an email address, enums to check the value of a drop down list, etc.
The latter must be done server side because you should never trust user input. It involves escaping strings against SQL injection attacks, validating field length against buffer overflow attacks (less common these days), etc.

Firstly you need to understand about 2 means of security.
Sanitation
Validation
Sanitation is cleaning data so that when you validate your data after removing any unneeded validation flaws.
Sanitation consists of removing characters such as non-visible chars (space,tabs,new-lines, ...) and they should be done across the board.
After validation your data, such as if(strlen($_GET['key']) > 0), you will be inserting the data to your database, but the ways of doing this varies depending on the database type
PHP Offers functions to escape data such as mysql_real_espae_string()
This method is refereed to as Database Escaping.

You need to validate your input, you can do this by Javascript functions which check the input before the form is submitted or you can also call PHP functions to check the values that the form submits before they are stored to a database. If you are using PHP you can opt to learn MVC frameworks such as CodeIgniter or CakePHP which make this process a whole lot easier and more friendly for you as a developer. Such frameworks normally have libraries with code for validations so you just need to use them and not write your own.

Output or Input filtering?

Output or Input filtering?
I constantly see people writing "filter you inputs", "sanitize your inputs", don't trust user data, but I only agree with the last one, where I consider trusting any external data a bad idea even if it is internal relative to the system.
Input filtering:
The most common that I see.
Take the form post data or any other external source of information and define some boundaries when saving it, for example making sure text is text, numbers are numbers, that sql is valid sql, that html is valid html and that it does not contain harmful markup, and then you save the "safe" data in the database.
But when fetching data you just use the raw data from the database.
In my personal opinion, the data is never really safe.
Although it sounds easy, just filter everything you get from forms and url's, in reality it is much harder than that, it might be safe for one language but not another.
Output filtering:
When doing it this way I save the raw unaltered data, whatever it might be, with prepared statements into the database and then filter out the problematic code when accessing the data, this has it's own advantages:
This adds a layer between html and the server side script.
which I consider to be data access separation of sorts.
Now data is filtered depending on the context, for example I can have the data from the database presented in a html document as plain-escaped-text, or as html or as anything anywhere.
The drawbacks here are that you must not ever forget to add the filtering which is a little bit harder than with input filtering and it uses a bit more CPU when providing data.
This does not mean that you don't need to do validation checks, you still do, it's just that you don't save the filtered data, you validate it and provide the user with a error message if the data is somehow invalid.
So instead of going with "filter your inputs" maybe it should be "validate your inputs, filter your outputs".
so should I go with "Input validation and filtering" or "Input validation and output filtering"?

There is no generic "filtering" for input and output.
Validate your input, escape your output. How you do this depends on context.
Validation is about making sure input falls within sensible ranges, like the length of strings, the numericality of dollar amounts or that a record being updated is owned by the user performing the update. This is about maintaining the logical consistency of your data and preventing people from doing things like zeroing the price of a product they are purchasing or deleting records they shouldn't have access to. It has nothing to do with "filtering" or escaping specific characters in your input.
Escaping is a matter of context, and only really makes sense when you're doing something with data that can be poisoned by injecting certain characters. Escape HTML characters in data you send to the browser. Escape SQL characters in data you send to the database. Escape quotes when you're writing data inside JavaScript <script> tags. Just be conscious of how the data you're dealing with is going to be interpreted by the system you're passing it to and escape accordingly.

The best solution is to filter both. Doing just one makes it more likely that you miss a case, and can leave you open to other types of attacks.
If you only do input filtering, an attacker could find a way to bypass your inputs and cause a vulnerability. This could be someone with access to your database entering data manually, it could be an attacker uploading a file through FTP or some other channel that is not checked, or many other methods.
If you only do output filtering, you can leave yourself open to SQL injection and other server side attacks.
The best method is to filter both your inputs and outputs. It may cause more load, but greatly reduces the risk of an attacker finding a vulnerability.

Sounds like semantics to me. Either way the important thing to remember is to make sure bad data doesn't get in the system.
Doing output filtering instead of input filtering is asking for an SQL Injection .

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.