Validating user input?

Validating user input? - php

I am very confused over something and was wondering if someone could explain.
In PHP i validate user input so htmlentitiies, mysql_real_escape_string is used before inserting into database, not on everything as i do prefer to use regular expressions when i can although i find them hard to work with. Now obviously i will use mysql_real_escape_string as the data is going into the database but not sure should i be using htmlentities() only when getting data from database and displaying it on a webpage as doing so before hand is altering the data entered by a person which is not keeping it's original form which may cause problems if i want to use that data later on for use for something else.
So for example, i have a guestbook with 3 fields name, subject and message. Now obviously the fields can contain anything like malicious code in js tags basically anything, now what confuses me is let say i am a malicious person and i decided to use js tags and some malicous js code and submit the form, now basically i have malicious useless data in my database. Now by using htmlentities when outputting the malicious code to the webpage (guestbook) that is not a problem because htmlentities has converted it to it's safe equivalent but then at the same time i have useless malicious code in the database that i would rather not have.
So after saying all this my question is should i accept the fact that some data in the database maybe malicious, useless data and as long as i use htmlentities on output everything will be ok or should i be doing something else aswell?.
I read so many books saying about filtering data on receiving it and escaping it on outputting it so the original form is kept but they only ever give examples like ensuring a field is only an int using functions already built into php etc but i have never found anything in regards ensuring something like a guestbook where you want users to type anything they want but also how you would filter such data apart from mysql_real_escape_string() to ensure it does not break the DB query?
Could someone please finally close this confusion for me and tell me what i should be doing and what is best practice?
Thanks to anyone who can explain.
Cheers!

This is a long question, but I think what you're actually asking boils down to:
"Should I escape HTML before inserting it into my database, or when I go to display it?"
The generally accepted answer to this question is that you should escape the HTML (via htmlspecialchars) when you go to display it to the user, and not before putting it into the database.
The reason is this: a database stores data. What you are putting into it is what the user typed. When you call mysql_real_escape_string, it does not alter what is inserted into the database; it merely avoids interpreting the user's input as SQL statements. htmlspecialchars does the same thing for HTML; when you print the user's input, it will avoid having it interpreted as HTML. If you were to call htmlspecialchars before the insert, you are no longer being faithful.
You should always strive to have the maximum-fidelity representation you can get. Since storing the "malicious" code in your database does no harm (in fact, it saves you some space, since escaped HTML is longer than unescaped!), and you might in the future want that HTML (what if you use an XML parser on user comments, or some day let trusted users have a subset of HTML in their comments, or some such?), why not let it be?
You also ask a bit about other types of input validation (integer constraints, etc). Your database schema should enforce these, and they can also be checked at the application layer (preferably on input via JS and then again server side).
On another note, the best way to do database escaping with PHP is probably to use PDO, rather than calling mysql_real_escape_string directly. PDO has more advanced functionality, including type checking.

mysql_real_escape_string() is all you need for the database operations. It'll ensure that a malicious user can't embed something into data that'll "break" your queries.
htmlentities() and htmlspecialchars() come into play when you're working with sending stuff to the client/browser. If you want to clean up potentially hostile HTML, you'd be better off using HTMLPurifier, which will strip the data to the bedrock and hose it down with bleach and rebuild it properly.

There's no reason to worry about having malicious JavaScript code in the database if you're escaping the HTML when it comes out. Just make sure you always do escape anything that comes out of the DB.

Related

Sanitizing data before storing it might mean stored data is different to what the user entered - is that common practice?

The Background
HTML form, e.g. for a user to submit their business details which will later appear on a legal document - so data needs to be precise.
Submits to a PHP script that validates all inputs.
If all inputs are valid, it sanitizes the data and writes it to a database using parameterised queries.
If any of the inputs are invalid, it re-displays the form. My feeling is that the user would expect this form to be populated with what they originally typed in with some feedback on what is wrong with their input. They can then amend their input and re-submit the form. This means the form needs to be populated with unsanitized data (this will be escaped before displaying it).
All good so far.
The Problem
If the data is valid, it is written to a database. Best practice seems to be to sanitize the data before sending it to the database.
This means the data written to the database might not be exactly what the user typed in (e.g. if sanitization removes some "dangerous" characters).
This seems like a poor user experience to me.
I'm using PHP and the code is running within the WordPress framework. WP has its own sanitization functions and they recommend always sanitizing input before using it. They also suggest using PHP's santization features too. But nothing seems to address the issue that sanitizing data before storing it might result in saved data being different from what the user entered.
The Question
What I'd like is a description of an approach that's been used in the real world that addresses this issue? Or some feedback from those of you more experience than I am, that this is not a problem in the real world and it's common practice just to sanitize data and store it without further concern or feedback to the user.
My thoughts about possible solutions
A more thorough pattern would be to consider unsanitary data as invalid and feedback to the user what is wrong with their input. But this seems impractical and would require fairly long sanitization functions to provide any specific and useful feedback to the user. It also renders existing WP/PHP sanitization functions somewhat irrelevant.
A practical compromise may be to compare sanitized data with raw data and then simply notify the user that something got cleaned up before it was saved... so they can at least check the saved data to make sure they're happy with it.
Thanks for your help.
Conclusions
The answer I've accepted was helpful and lead me to a solution to my particular use case, but I wanted to add a few points of my own.
Firstly, on re-reading the WP documentation I found that it's not recommending to validate AND then sanitize before writing to a database. It recommends to validate, but suggests sanitizing the input might be more convenient if the particular situation does not require strict validation. It also says use one or the other, not both. So I don't think the WP documentation is wrong on this, I just misread it.
Secondly, I didn't understand that parameterized queries are so effective against SQL injection. So I figured that sanitizing input before using it in a DB query was a sensible thing to do. But it seems it's not necessary.
And finally, I now realise that it's all about context... the issue is making data safe for a particular use. In that sense, it's not that one technique is only appropriate for input and another technique is only appropriate for output. I need to think about validating, sanitizing or escaping when doing anything with the data - e.g. write it to a DB, use it in a calculation, print it to screen, or inject it into a PDF document. And in all cases, I just need to think about how I make it safe for that particular use. Sanitizing "input" might be entirely appropriate - if it's quick and easy, makes the data safe for whatever I need to do and doesn't render the data inaccurate. Another example is the WordPress function esc_url_raw() which the manual says is specifically to be used when storing a URL in the database. So again, the idea that escaping is only appropriate for "output" is misleading.
I ended up validating the input before writing it to the database. I did not need to sanitize it aswell. So I if it's invalid, I tell the user. If it's valid, it gets written to the DB in its original form. And I escape it before displaying it back to the user.

Best practice seems to be to sanitize the data before sending it to the database.
This is a common misconception. Sanitization should only be performed on data that is being output, to prevent XSS for example and even then only as a last resort. Exactly because it can irreversibly destroy the original data.
Validation is your first line of defense. Make sure that the data is properly formatted, and valid within its context - just that; no looking for special characters, don't be over-zealous. If it's not valid - reject it, don't try to salvage the "good" parts from it.
Then, when storing in database, you merely need to use parameterized queries - that is 100% effective against SQL injections. If you didn't mangle the data in a previous step, you're storing it in its original form.
And finally, when the data is being output, that is where you SHOULD escape special characters within the appropriate context, so that it is properly rendered; or sanitize it if you have no other choice (i.e. the context is unclear and therefore you can't do proper escaping).

It looks like you are worried about user feelings, that's good. There are few things which you can do.
Use html form pattern - for sure no one name needs signs like < > & $ " ... - exclude this with pattern, use css :invalid and :invalid:focus to inform the user before submitting if something is wrong. It is very easy and simple.
Than goes php further validation and WP sanitation.
You can use intermediate state - after 'wash' - display final version (no inputs) with 2 buttons - save or correction - let the user decide, most of us don't like this repetitions "are you sure? clicking submit you mean submit?" - but maybe with so relevant content, users would like to have last chance, and they wish to see the final version (without inputs, checkboxes etc).
Now you put accepted version into db (prepared).
Comparing raw data with washed is not practical, honestly it sucks - the users won't be coders - they just won't be able to correctly understand "we sanitised your answers, and now they are 345 characters shorter. Sorry for inconvenience"
Don't worry too much
...there is a german last name 'Ei' - only 2 characters, so pattern can't require more than 2.

Should we be escaping strings in 2017 or does PHP do it for us?

The reason I ask this question is because I was checking stackoverflow for answer, and since 2012/13 it no longer seems to be a hot topic and all the answers documentation is deprecated. Could you please tell me if we still should be doing this and if so what's a secure way to do so? I'm specifically talking about user defined post data...
Update: the string will be html inputted from user and posted into my dB.

The short answer is yes. Even in 2017 you should be escaping strings in PHP. PHP does not do it by itself because not every developer will want to develop a product / functionality that needs to escape user input (for whatever that reason may be).
If you are echoing user inputted data to a webpage, you should use the function htmlspecialchars() to stop potential malicious coding from executing upon being read by your browser.
When you are retrieving data from a client, you can also use the FILTER_INPUT functions to validate incoming data to validate that the clients data is actually the data you want (e.g checking that no one has bypassed your client side validation and has entered Illegal characters into the data)
From my experience these are two great functions that can be used to 1:) escape output to a client and 2:) prevent the chance of malicious code being stored/processed on your server.

It depends entirely on what you are going to do with the string.
If you are going to treat it as code (whether that code is HTML, JavaScript, PHP, SQL or something else) then it will need escaping.
PHP is not able to tell if you trust the source of the data to write safe code.

In 2017 this is what is usually done in the scenario you describe:
The user inputs text in a form, the text is sent to the server, before that the text is url encoded (this is one form or escaping). This is typically done by the browser/javascript so no need to do it manually (but it does happen).
The server receives the text, decodes it and then creates a MySQL insert/update statement to store it in the database. While some people still run the mysqli_real_escape_string on it, the recommended way is to use prepared statements instead. Therefore in this aspect you do not need to do the escaping, however prepared statements delegate escaping to the database (so again escaping does happen)
If the user inputted text is to be presented back on a page then it is encoded via htmlentities or similar (which is itself another form of escaping). This is mostly ran manually although most new view template frameworks (e.g. twig or blade) take care of that for us.
So that's how it is today as far as I know. Escaping is very much required, but the programmer actually doing it is not so much a requirement if modern frameworks and practices are used.

Yes, escaping the strings from the request (and therefore imputable by the user) is a practical requirement because PHP makes available the data actually added to the payload of the request without any modification that could invalidate the data itself (not all the data needs Of escaping), so any subsequent processing on that data must be made and under the developer's control.
The escape of variables in database interaction operations to prevent SQL Injections.
In past versions of PHP there was the "magic_quoteas" feature that filtered every variable in GET or POST. But it is deprecated and is not a best practice. Why Not?
The state of the art in querying DB is predominantly in using the PDO driver with the prepared statement. At the time the variable is bound, the variable will be escaped automatically.
$conn->prepare('SELECT * FROM users WHERE name = :name');
$conn->bindParam(':name',$_GET['username']); //this do the escape too
$conn->execute();
Alternatively, mysql_real_escape_string manages it manually.
Alternatively, mysqli::real_escape_string manages it manually.

Sanitizing PHP Variables, am I overusing it?

I've been working with PHP for some time and I began asking myself if I'm developing good habits.
One of these is what I belive consists of overusing PHP sanitizing methods, for example, one user registers through a form, and I get the following post variables:
$_POST['name'], $_POST['email'] and $_POST['captcha']. Now, what I usually do is obviously sanitize the data I am going to place into MySQL, but when comparing the captcha, I also sanitize it.
Therefore I belive I misunderstood PHP sanitizing, I'm curious, are there any other cases when you need to sanitize data except when using it to place something in MySQL (note I know sanitizing is also needed to prevent XSS attacks). And moreover, is my habit to sanitize almost every variable coming from user-input, a bad one ?

Whenever you store your data someplace, and if that data will be read/available to (unsuspecting) users, then you have to sanitize it. So something that could possibly change the user experience (not necessarily only the database) should be taken care of. Generally, all user input is considered unsafe, but you'll see in the next paragraph that some things might still be ignored, although I don't recommend it whatsoever.
Stuff that happens on the client only is sanitized just for a better UX (user experience, think about JS validation of the form - from the security standpoint it's useless because it's easily avoidable, but it helps non-malicious users to have a better interaction with the website) but basically, it can't do any harm because that data (good or bad) is lost as soon as the session is closed. You can always destroy a webpage for yourself (on your machine), but the problem is when someone can do it for others.
To answer your question more directly - never worry about overdoing it. It's always better to be safe than sorry, and the cost is usually not more than a couple of milliseconds.

The term you need to search for is FIEO. Filter Input, Escape Output.
You can easily confound yourself if you do not understand this basic principle.
Imagine PHP is the man in the middle, it receives with the left hand and doles out with the right.
A user uses your form and fills in a date form, so it should only accept digits and maybe, dashes. e.g. nnnnn-nn-nn. if you get something which does not match that, then reject it.
That is an example of filtering.
Next PHP, does something with it, lets say storing it in a Mysql database.
What Mysql needs is to be protected from SQL injection, so you use PDO, or Mysqli's prepared statements to make sure that EVEN IF your filter failed you cannot permit an attack on your database. This is an example of Escaping, in this case escaping for SQL storage.
Later, PHP gets the data from your db and displays it onto a HTML page. So you need to Escape the data for the next medium, HTML (this is where you can permit XSS attacks).
In your head you have to divide each of the PHP 'protective' functions into one or other of these two families, Filtering or Escaping.
Freetext fields are of course more complex than filtering for a date, but never mind, stick to the principles and you will be OK.
Hoping this helps http://phpsec.org/projects/guide/

gameplan for escaping php MySQL content

Ive read every post here on escaping and unfortunately almost every one has disagreements amongst posters so I just want to ask the community about my specific situation before I make a major mistake because I misunderstood another post.
I am storing user preferences in a MySQL database where I personally place the information directly into the database myself, not user submitted inputs.
My questions are:
1.) If I am running a PHP query and placing the query result into other PHP code blocks, not as HTML but just as things like other queries, ie(SELECT * from $queryresult) there is no need to escape this correct?
2.) If I am outputting what I stored in the database as html directly from the database do I need to sanitize this output in anyway. My understanding is that sanitization is strictly for user submitted input. Need I really worry about data coming out of database fields I personally populated.
I think I know the answers here after reading but I dont want to leave any room for error on this one.

Question 1 - Escaping data for MySQL queries
No, you must always escape data in your queries, regardless of the source. Data escaping is for the query parser. Even if the data comes from your own code, you must escape it.
Learn to use PDO to avoid this problem.
Question 2 - Escaping data for HTML
If you are outputting data to HTML, you must always escape it with htmlspecialchars() or equivalent. This is so you don't have to worry about bad HTML code, as well as XSS.

I'm learning PHP on my own and I've become aware of the strip_tags() function. Is this the only way to increase security?

I'm new to PHP and I'm following a tutorial here:
Link
It's pretty scary that a user can write php code in an input and basically screw your site, right?
Well, now I'm a bit paranoid and I'd rather learn security best practices right off the bat than try to cram them in once I have some habits in me.
Since I'm brand new to PHP (literally picked it up two days ago), I can learn pretty much anything easily without getting confused.
What other way can I prevent shenanigans on my site? :D

There are several things to keep in mind when developing a PHP application, strip_tags() only helps with one of those. Actually strip_tags(), while effective, might even do more than needed: converting possibly dangerous characters with htmlspecialchars() should even be preferrable, depending on the situation.
Generally it all comes down to two simple rules: filter all input, escape all output. Now you need to understand what exactly constitutes input and output.
Output is easy, everything your application sends to the browser is output, so use htmlspecialchars() or any other escaping function every time you output data you didn't write yourself.
Input is any data not hardcoded in your PHP code: things coming from a form via POST, from a query string via GET, from cookies, all those must be filtered in the most appropriate way depending on your needs. Even data coming from a database should be considered potentially dangerous; especially on shared server you never know if the database was compromised elsewhere in a way that could affect your app too.
There are different ways to filter data: white lists to allow only selected values, validation based on expcted input format and so on. One thing I never suggest is try fixing the data you get from users: have them play by your rules, if you don't get what you expect, reject the request instead of trying to clean it up.
Special attention, if you deal with a database, must be paid to SQL injections: that kind of attack relies on you not properly constructing query strings you send to the database, so that the attacker can forge them trying to execute malicious instruction. You should always use an escaping function such as mysql_real_escape_string() or, better, use prepared statements with the mysqli extension or using PDO.
There's more to say on this topic, but these points should get you started.
HTH
EDIT: to clarify, by "filtering input" I mean decide what's good and what's bad, not modify input data in any way. As I said I'd never modify user data unless it's output to the browser.

strip_tags is not the best thing to use really, it doesn't protect in all cases.
HTML Purify:
http://htmlpurifier.org/
Is a real good option for processing incoming data, however it itself still will not cater for all use cases - but it's definitely a good starting point.

I have to say that the tutorial you mentioned is a little misleading about security:
It is important to note that you never want to directly work with the $_GET & $_POST values. Always send their value to a local variable, & work with it there. There are several security implications involved with the values when you directly access (or
output) $_GET & $_POST.
This is nonsense. Copying a value to a local variable is no more safe than using the $_GET or $_POST variables directly.
In fact, there's nothing inherently unsafe about any data. What matters is what you do with it. There are perfectly legitimate reasons why you might have a $_POST variable that contains ; rm -rf /. This is fine for outputting on an HTML page or storing in a database, for example.
The only time it's unsafe is when you're using a command like system or exec. And that's the time you need to worry about what variables you're using. In this case, you'd probably want to use something like a whitelist, or at least run your values through escapeshellarg.
Similarly with sending queries to databases, sending HTML to browsers, and so on. Escape the data right before you send it somewhere else, using the appropriate escaping method for the destination.

strip_tags removes every piece of html. more sophisticated solutions are based on whitelisting (i.e. allowing specific html tags). a good whitelisting library is htmlpurifyer http://htmlpurifier.org/
and of course on the database side of things use functions like mysql_real_escape_string or pg_escape_string

Well, probably I'm wrong, but... In all literature, I've read, people say It's much better to use htmlspellchars.
Also, rather necessary to cast input data. (for int for example, if you are sure it's user id).
Well, beforehand, when you'll start using database - use mysql_real_escape_string instead of mysql_escape_string to prevent SQL injections (in some old books it's written mysql_escape_string still).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.