Output or Input filtering?
I constantly see people writing "filter you inputs", "sanitize your inputs", don't trust user data, but I only agree with the last one, where I consider trusting any external data a bad idea even if it is internal relative to the system.
Input filtering:
The most common that I see.
Take the form post data or any other external source of information and define some boundaries when saving it, for example making sure text is text, numbers are numbers, that sql is valid sql, that html is valid html and that it does not contain harmful markup, and then you save the "safe" data in the database.
But when fetching data you just use the raw data from the database.
In my personal opinion, the data is never really safe.
Although it sounds easy, just filter everything you get from forms and url's, in reality it is much harder than that, it might be safe for one language but not another.
Output filtering:
When doing it this way I save the raw unaltered data, whatever it might be, with prepared statements into the database and then filter out the problematic code when accessing the data, this has it's own advantages:
This adds a layer between html and the server side script.
which I consider to be data access separation of sorts.
Now data is filtered depending on the context, for example I can have the data from the database presented in a html document as plain-escaped-text, or as html or as anything anywhere.
The drawbacks here are that you must not ever forget to add the filtering which is a little bit harder than with input filtering and it uses a bit more CPU when providing data.
This does not mean that you don't need to do validation checks, you still do, it's just that you don't save the filtered data, you validate it and provide the user with a error message if the data is somehow invalid.
So instead of going with "filter your inputs" maybe it should be "validate your inputs, filter your outputs".
so should I go with "Input validation and filtering" or "Input validation and output filtering"?
There is no generic "filtering" for input and output.
Validate your input, escape your output. How you do this depends on context.
Validation is about making sure input falls within sensible ranges, like the length of strings, the numericality of dollar amounts or that a record being updated is owned by the user performing the update. This is about maintaining the logical consistency of your data and preventing people from doing things like zeroing the price of a product they are purchasing or deleting records they shouldn't have access to. It has nothing to do with "filtering" or escaping specific characters in your input.
Escaping is a matter of context, and only really makes sense when you're doing something with data that can be poisoned by injecting certain characters. Escape HTML characters in data you send to the browser. Escape SQL characters in data you send to the database. Escape quotes when you're writing data inside JavaScript <script> tags. Just be conscious of how the data you're dealing with is going to be interpreted by the system you're passing it to and escape accordingly.
The best solution is to filter both. Doing just one makes it more likely that you miss a case, and can leave you open to other types of attacks.
If you only do input filtering, an attacker could find a way to bypass your inputs and cause a vulnerability. This could be someone with access to your database entering data manually, it could be an attacker uploading a file through FTP or some other channel that is not checked, or many other methods.
If you only do output filtering, you can leave yourself open to SQL injection and other server side attacks.
The best method is to filter both your inputs and outputs. It may cause more load, but greatly reduces the risk of an attacker finding a vulnerability.
Sounds like semantics to me. Either way the important thing to remember is to make sure bad data doesn't get in the system.
Doing output filtering instead of input filtering is asking for an SQL Injection .
Related
The Background
HTML form, e.g. for a user to submit their business details which will later appear on a legal document - so data needs to be precise.
Submits to a PHP script that validates all inputs.
If all inputs are valid, it sanitizes the data and writes it to a database using parameterised queries.
If any of the inputs are invalid, it re-displays the form. My feeling is that the user would expect this form to be populated with what they originally typed in with some feedback on what is wrong with their input. They can then amend their input and re-submit the form. This means the form needs to be populated with unsanitized data (this will be escaped before displaying it).
All good so far.
The Problem
If the data is valid, it is written to a database. Best practice seems to be to sanitize the data before sending it to the database.
This means the data written to the database might not be exactly what the user typed in (e.g. if sanitization removes some "dangerous" characters).
This seems like a poor user experience to me.
I'm using PHP and the code is running within the WordPress framework. WP has its own sanitization functions and they recommend always sanitizing input before using it. They also suggest using PHP's santization features too. But nothing seems to address the issue that sanitizing data before storing it might result in saved data being different from what the user entered.
The Question
What I'd like is a description of an approach that's been used in the real world that addresses this issue? Or some feedback from those of you more experience than I am, that this is not a problem in the real world and it's common practice just to sanitize data and store it without further concern or feedback to the user.
My thoughts about possible solutions
A more thorough pattern would be to consider unsanitary data as invalid and feedback to the user what is wrong with their input. But this seems impractical and would require fairly long sanitization functions to provide any specific and useful feedback to the user. It also renders existing WP/PHP sanitization functions somewhat irrelevant.
A practical compromise may be to compare sanitized data with raw data and then simply notify the user that something got cleaned up before it was saved... so they can at least check the saved data to make sure they're happy with it.
Thanks for your help.
Conclusions
The answer I've accepted was helpful and lead me to a solution to my particular use case, but I wanted to add a few points of my own.
Firstly, on re-reading the WP documentation I found that it's not recommending to validate AND then sanitize before writing to a database. It recommends to validate, but suggests sanitizing the input might be more convenient if the particular situation does not require strict validation. It also says use one or the other, not both. So I don't think the WP documentation is wrong on this, I just misread it.
Secondly, I didn't understand that parameterized queries are so effective against SQL injection. So I figured that sanitizing input before using it in a DB query was a sensible thing to do. But it seems it's not necessary.
And finally, I now realise that it's all about context... the issue is making data safe for a particular use. In that sense, it's not that one technique is only appropriate for input and another technique is only appropriate for output. I need to think about validating, sanitizing or escaping when doing anything with the data - e.g. write it to a DB, use it in a calculation, print it to screen, or inject it into a PDF document. And in all cases, I just need to think about how I make it safe for that particular use. Sanitizing "input" might be entirely appropriate - if it's quick and easy, makes the data safe for whatever I need to do and doesn't render the data inaccurate. Another example is the WordPress function esc_url_raw() which the manual says is specifically to be used when storing a URL in the database. So again, the idea that escaping is only appropriate for "output" is misleading.
I ended up validating the input before writing it to the database. I did not need to sanitize it aswell. So I if it's invalid, I tell the user. If it's valid, it gets written to the DB in its original form. And I escape it before displaying it back to the user.
Best practice seems to be to sanitize the data before sending it to the database.
This is a common misconception. Sanitization should only be performed on data that is being output, to prevent XSS for example and even then only as a last resort. Exactly because it can irreversibly destroy the original data.
Validation is your first line of defense. Make sure that the data is properly formatted, and valid within its context - just that; no looking for special characters, don't be over-zealous. If it's not valid - reject it, don't try to salvage the "good" parts from it.
Then, when storing in database, you merely need to use parameterized queries - that is 100% effective against SQL injections. If you didn't mangle the data in a previous step, you're storing it in its original form.
And finally, when the data is being output, that is where you SHOULD escape special characters within the appropriate context, so that it is properly rendered; or sanitize it if you have no other choice (i.e. the context is unclear and therefore you can't do proper escaping).
It looks like you are worried about user feelings, that's good. There are few things which you can do.
Use html form pattern - for sure no one name needs signs like < > & $ " ... - exclude this with pattern, use css :invalid and :invalid:focus to inform the user before submitting if something is wrong. It is very easy and simple.
Than goes php further validation and WP sanitation.
You can use intermediate state - after 'wash' - display final version (no inputs) with 2 buttons - save or correction - let the user decide, most of us don't like this repetitions "are you sure? clicking submit you mean submit?" - but maybe with so relevant content, users would like to have last chance, and they wish to see the final version (without inputs, checkboxes etc).
Now you put accepted version into db (prepared).
Comparing raw data with washed is not practical, honestly it sucks - the users won't be coders - they just won't be able to correctly understand "we sanitised your answers, and now they are 345 characters shorter. Sorry for inconvenience"
Don't worry too much
...there is a german last name 'Ei' - only 2 characters, so pattern can't require more than 2.
I've been working with PHP for some time and I began asking myself if I'm developing good habits.
One of these is what I belive consists of overusing PHP sanitizing methods, for example, one user registers through a form, and I get the following post variables:
$_POST['name'], $_POST['email'] and $_POST['captcha']. Now, what I usually do is obviously sanitize the data I am going to place into MySQL, but when comparing the captcha, I also sanitize it.
Therefore I belive I misunderstood PHP sanitizing, I'm curious, are there any other cases when you need to sanitize data except when using it to place something in MySQL (note I know sanitizing is also needed to prevent XSS attacks). And moreover, is my habit to sanitize almost every variable coming from user-input, a bad one ?
Whenever you store your data someplace, and if that data will be read/available to (unsuspecting) users, then you have to sanitize it. So something that could possibly change the user experience (not necessarily only the database) should be taken care of. Generally, all user input is considered unsafe, but you'll see in the next paragraph that some things might still be ignored, although I don't recommend it whatsoever.
Stuff that happens on the client only is sanitized just for a better UX (user experience, think about JS validation of the form - from the security standpoint it's useless because it's easily avoidable, but it helps non-malicious users to have a better interaction with the website) but basically, it can't do any harm because that data (good or bad) is lost as soon as the session is closed. You can always destroy a webpage for yourself (on your machine), but the problem is when someone can do it for others.
To answer your question more directly - never worry about overdoing it. It's always better to be safe than sorry, and the cost is usually not more than a couple of milliseconds.
The term you need to search for is FIEO. Filter Input, Escape Output.
You can easily confound yourself if you do not understand this basic principle.
Imagine PHP is the man in the middle, it receives with the left hand and doles out with the right.
A user uses your form and fills in a date form, so it should only accept digits and maybe, dashes. e.g. nnnnn-nn-nn. if you get something which does not match that, then reject it.
That is an example of filtering.
Next PHP, does something with it, lets say storing it in a Mysql database.
What Mysql needs is to be protected from SQL injection, so you use PDO, or Mysqli's prepared statements to make sure that EVEN IF your filter failed you cannot permit an attack on your database. This is an example of Escaping, in this case escaping for SQL storage.
Later, PHP gets the data from your db and displays it onto a HTML page. So you need to Escape the data for the next medium, HTML (this is where you can permit XSS attacks).
In your head you have to divide each of the PHP 'protective' functions into one or other of these two families, Filtering or Escaping.
Freetext fields are of course more complex than filtering for a date, but never mind, stick to the principles and you will be OK.
Hoping this helps http://phpsec.org/projects/guide/
I've implemented input validation on all of my input data using php (as well as js on the front-end). I'm type casting where I can, validating stuff like emails against a regex, making sure dropdown values are only ones I'm expecting and also in many cases where I'm expecting only a string I have a regex that runs that only allows letters, numbers and spaces. Anything that doesn't meet these rules results in the form failing validation and no sql queries are run.
With that said if my form passes validation I'm making the assumption that it's safe for input in to my db (which I'm doing via pdo) and then escaped on output.
So with that said why do I need input sanitization?
If you have very strict validation server-side, you don't need to sanatize. Eg. validating a string against /^[a-z0-9]{5,25}$/ will not need any sanitization (removing non alphanumeric characters will not make any sense, since they should not be able to pass anyway).
Just make sure you can validate all data, and if that's impossible (e.g. with html it tends to be a bit difficult), you can use escaping strategies or things like html purifier.
For a good overview on escaping strategies for XSS prevention:
see https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet
For an idea of different security threats:
https://www.owasp.org/index.php/PHP_Security_Cheat_Sheet
You need both. Validating input data is easily beaten at the client side, but it's useful for legitimate users who aren't trying to hack you.
Sanitize the data (all the data, whether it's input data or something straight from your DB that you think you should be able to trust) before putting it into your database.
Even if you 100% trust your validation and do it on the server side (where, in theory, people shouldn't be able to mess with the data), it's still worth using some form of sanitizing because it's a good habit to get into.
Helllo friends
I have developed a form.Which allows the user to store there data.now when i am storing the data wat all care i must take so that my any wrong values are not inserted.Or it is not hacked
What you're asking about is called input validation, and there's a lot of information about it out there.
There are primarily two parts:
Making sure the user put in something useful.
Making sure the user didn't put in something harmful.
The former is most often done via JavaScript on the client side (for a generally smoother user experience and fewer postbacks). It should be re-done on the server side as well just to make sure, since you should never trust user input. Basically it involves things like regular expressions to check the format of an email address, enums to check the value of a drop down list, etc.
The latter must be done server side because you should never trust user input. It involves escaping strings against SQL injection attacks, validating field length against buffer overflow attacks (less common these days), etc.
Firstly you need to understand about 2 means of security.
Sanitation
Validation
Sanitation is cleaning data so that when you validate your data after removing any unneeded validation flaws.
Sanitation consists of removing characters such as non-visible chars (space,tabs,new-lines, ...) and they should be done across the board.
After validation your data, such as if(strlen($_GET['key']) > 0), you will be inserting the data to your database, but the ways of doing this varies depending on the database type
PHP Offers functions to escape data such as mysql_real_espae_string()
This method is refereed to as Database Escaping.
You need to validate your input, you can do this by Javascript functions which check the input before the form is submitted or you can also call PHP functions to check the values that the form submits before they are stored to a database. If you are using PHP you can opt to learn MVC frameworks such as CodeIgniter or CakePHP which make this process a whole lot easier and more friendly for you as a developer. Such frameworks normally have libraries with code for validations so you just need to use them and not write your own.
I am very confused over something and was wondering if someone could explain.
In PHP i validate user input so htmlentitiies, mysql_real_escape_string is used before inserting into database, not on everything as i do prefer to use regular expressions when i can although i find them hard to work with. Now obviously i will use mysql_real_escape_string as the data is going into the database but not sure should i be using htmlentities() only when getting data from database and displaying it on a webpage as doing so before hand is altering the data entered by a person which is not keeping it's original form which may cause problems if i want to use that data later on for use for something else.
So for example, i have a guestbook with 3 fields name, subject and message. Now obviously the fields can contain anything like malicious code in js tags basically anything, now what confuses me is let say i am a malicious person and i decided to use js tags and some malicous js code and submit the form, now basically i have malicious useless data in my database. Now by using htmlentities when outputting the malicious code to the webpage (guestbook) that is not a problem because htmlentities has converted it to it's safe equivalent but then at the same time i have useless malicious code in the database that i would rather not have.
So after saying all this my question is should i accept the fact that some data in the database maybe malicious, useless data and as long as i use htmlentities on output everything will be ok or should i be doing something else aswell?.
I read so many books saying about filtering data on receiving it and escaping it on outputting it so the original form is kept but they only ever give examples like ensuring a field is only an int using functions already built into php etc but i have never found anything in regards ensuring something like a guestbook where you want users to type anything they want but also how you would filter such data apart from mysql_real_escape_string() to ensure it does not break the DB query?
Could someone please finally close this confusion for me and tell me what i should be doing and what is best practice?
Thanks to anyone who can explain.
Cheers!
This is a long question, but I think what you're actually asking boils down to:
"Should I escape HTML before inserting it into my database, or when I go to display it?"
The generally accepted answer to this question is that you should escape the HTML (via htmlspecialchars) when you go to display it to the user, and not before putting it into the database.
The reason is this: a database stores data. What you are putting into it is what the user typed. When you call mysql_real_escape_string, it does not alter what is inserted into the database; it merely avoids interpreting the user's input as SQL statements. htmlspecialchars does the same thing for HTML; when you print the user's input, it will avoid having it interpreted as HTML. If you were to call htmlspecialchars before the insert, you are no longer being faithful.
You should always strive to have the maximum-fidelity representation you can get. Since storing the "malicious" code in your database does no harm (in fact, it saves you some space, since escaped HTML is longer than unescaped!), and you might in the future want that HTML (what if you use an XML parser on user comments, or some day let trusted users have a subset of HTML in their comments, or some such?), why not let it be?
You also ask a bit about other types of input validation (integer constraints, etc). Your database schema should enforce these, and they can also be checked at the application layer (preferably on input via JS and then again server side).
On another note, the best way to do database escaping with PHP is probably to use PDO, rather than calling mysql_real_escape_string directly. PDO has more advanced functionality, including type checking.
mysql_real_escape_string() is all you need for the database operations. It'll ensure that a malicious user can't embed something into data that'll "break" your queries.
htmlentities() and htmlspecialchars() come into play when you're working with sending stuff to the client/browser. If you want to clean up potentially hostile HTML, you'd be better off using HTMLPurifier, which will strip the data to the bedrock and hose it down with bleach and rebuild it properly.
There's no reason to worry about having malicious JavaScript code in the database if you're escaping the HTML when it comes out. Just make sure you always do escape anything that comes out of the DB.