Currently I am upgrading a web application in which I will get most of the input from logged in users. The input will contains valid html, images, audio, video & upload facilities to user defined path. The application then formats it into nice ui and displays to end users. These privileged users can add / modify / delete the content using a web based interface.
As per the basic rule of thumb: I should escape my data before entering in DB, and not to receive data receive from user. To achieve that I have planned to follow following security measures. Which also includes my questions
I am using prepared statements to store all user inputs to DB. I hope this eliminates the DB injection threat.
Is this measure enough? or do i need to check for % and _ symbols as well for mysql LIKE queries?
The user input (lets call input A), where I am not expecting any HTML/css, I use strip_tags & htmlentities before inserting in DB.
Is this adequate measure ? Should I be using more
The user input (lets call input B), in which user can have html/css tags, I user htmlentities on text then insert in DB.
As far as I am aware I should not use htmlentities before inserting in the DB, but have to as previous programmer was using it. Are there any negative impacts for this?
After fetching from DB and Before displaying the input A / input B , I am not doing any pre processing assuming, the data added to DB should be clean.
Should i process / sanitize the data before displaying ? If yes then how ?
I want to html tags enters by user to be parsed by browser and not displayed to user. e.g. if user had entered <p style='color:red;'>hello</p><p class='noclass'>world</p>, I want user to see 2 words only and not actual text.
To achieve this how can I make sure that user doesn't add malicious script and at the same time the html tags are stored, fetched and parsed by browser correctly.
Please guide if the current approach is sufficient / not sufficient / less / incorrect.
I am neither a 100% newbie to php nor I m pro. I know the basics about php (or we can say over all web applications') security. So can someone can please guide me if I am making any mistake security wise OR should not be doing something OR should be doing something more or less.
I know the basics of security but I still get confused over
Which exact security measure to apply at which exact point ? (e.g. escape string BEFORE inserting to DB)
At every point what the functions available in php? (e.g. to escape strings use prepared statements)
Yes, prepared statements are great at preventing SQL injections problems. Yes, you will have to take care of % and _ in LIKE queries, a prepared statement cannot escape them since it has no way to know whether you want those values there or not.
through 5.: It's always a bad idea to escape data going into the database for a format it's destined for on output. Why? First of all, why are you so sure you're always going to use the data in an HTML context? Maybe you'll be using it in a different format in the future, and then you'll have garbage looking data. (This is more hypothetical in your case, as you're explicitly storing HTML.)
Secondly though, your output code will have to rely on your input code to correctly have escaped data in advance, possibly with a long time between input and output. Your output code can have no confidence whatsoever that the input code did the correct job for what the output code needs it to do. Therefore, escaping for output must happen at the time of output. No sooner, no later.
Thirdly (is that a word?), strip_tags is absolutely insufficient to accept some HTML but not other "insecure" HTML. You need a more complex library which has more complex whitelisting rules than what strip_tags can do. Supposedly the only library that does that is HTML Purifier. I'd run all user HTML through it.
To summarise:
Prepared statements.
HTML-escape data that is not supposed to contain literal HTML on output.
Run any data that is supposed to contain literal HTML through HTML Purifier. Whether you do this before or after inserting to the database is up to you, depending on whether you want to store the literal input the user sent you or whether you don't mind discarding that original data immediately and storing only sanitised data instead. But, the same caveat about having confidence in your output code applies too.
Related
I have a simple app programmed in PHP using CodeIgniter 4 framework and, as a web application, it has some HTML forms for user input.
I am doing two things:
In my Views, all variables from the database that come from user input are sanitized using CodeIgniter 4's esc() function.
In my Controllers, when reading HTTP POST data, I am using PHP filters:
$data = trim($this->request->getPost('field', FILTER_SANITIZE_SPECIAL_CHARS));
I am not sure if sanitizing both when reading data from POST and when printing/displaying to HTML is a good practice or if it should only be sanitized once.
In addition, FILTER_SANITIZE_SPECIAL_CHARS is not working as I need. I want my HTML form text input to prevent users from attacking with HTML but I want to keep some 'line breaks' my database has from the previous application.
FILTER_SANITIZE_SPECIAL_CHARS will NOT delete HTML tags, it will just store them in the database, not as HTML, but it is also changing my 'line breaks'. Is there a filter that doesn't remove HTML tags (only stores them with proper condification) but that respects \n 'line breaks'?
You don't need to sanitize User input data as explained in the question below:
How can I sanitize user input with PHP?
It's a common misconception that user input can be filtered. PHP even
has a (now deprecated) "feature", called
magic-quotes,
that builds on this idea. It's nonsense. Forget about filtering (or
cleaning, or whatever people call it).
In addition, you don't need to use FILTER_SANITIZE_SPECIAL_CHARS, htmlspecialchars(...), htmlentities(...), or esc(...) either for most use cases:
-Comment from OP (user1314836)
I definitely think that I don't need to sanitize user-input data
because I am not writing SQL directly but rather using CodeIgniter 4's
functions to create SQL safe queries. On the other hand, I do
definitely need to esc() that same information when showing to avoid
showing html where just text is expected.
The reason why you don't need the esc() method for most use cases is:
Most User form input in an application doesn't expect a User to submit/post HTML, CSS, or JavaScript that you plan on displaying/running later on.
If the expected User input is just plain text (username, age, birth date, etc), images, or files, use form validation instead to disallow unexpected data.
I.e: Available Rules and Creating Custom Rules
By using the Query Builder for your database queries and rejecting unexpected User input data using validation rules (alpha, alpha_numeric_punct, numeric, exact_length, min_length[8], valid_date, regex_match[/regex/], uploaded, etc), you can avoid most potential security holes i.e: SQL injections and XSS attacks.
Answer from steven7mwesigwa gets my vote, but here is how you should be thinking about it.
Rules Summary
You should always hold in memory the actual data that you want to process.
You should always convert the data on output into a format that the output can process.
Inputs:
You should strip from all untrusted inputs (user forms, databases that you didn't write to, XML feeds that you don't control etc)
any data that you are unable to process (e.g. if you are not able to handle multi-byte strings as you are not using the right functions, or your DB won't support it, or you can't handle UTF8/16 etc, strip those extra characters you can't handle).
any data that will never form part of the process or output (e.g. if you can only have an integer/bool than convert to int/bool; if you are only showing data on an HTML page, then you may as well trim spaces; if you want a date, strip anything that can't be formatted as a date [or reject*]).
This means that many "traditional" cleaning functions are not needed (e.g. Magic Quotes, strip_tags and so on): but you need to know you can handle the code. You should only strip_tags or escape or so on if you know it is pointless having that data in that field.
Note: For user input I prefer to hold the data as the user entered and reject the form allowing them to try again. e.g. If I'm expected a number and I get "hello" then I'll reload the form with "hello" and tell the user to try again. steven7mwesigwa has links to the validation functions in CI that make that happen.
Outputs:
Choose the correct conversion for the output: and don't get them muddled up.
htmlspecialchars (or family) for outputting to HTML or XML; although this is usually handled by any templating engine you use.
Escaping for DB input; although this should be left to the DB engine you use (e.g. parameterised queries, query builder etc).
urlencode for outputting a URL
as required for saving images, json, API responses etc
Why?
If you do out output conversion on input, then you can easily double-convert an input, or lose track of if you need to make it safe before output, or lose data the user wanted to enter. Mistakes happen but following clean rules will prevent it.
This also mean there is no need to reject special characters (those forms that reject quote marks are horrible user experience, for example, and anyone putting restrictions on what characters can go in a password field are only weakening security)
In your particular case:
Drop the FILTER_SANITIZE_SPECIAL_CHARS on input, hold the data as the user gave it to you
Output using template engine as you have it: this will display < > tags as the user entered then, but won't break your output.
You will essentially sanitize each and every output (that you appear to want to avoid), but that's safer than accidentally missing a sanitize on output and a better user experience than losing stuff they typed.
From my understanding,
FILTER_SANITIZE_SPECIAL_CHARS is used to sanitize the user input before you act on it or store it.
Whereas esc is used to escape HTML etc in the string so they don't interfere with normal html, css etc. It is used for viewing the data.
So, you need both, one for input and the other for output.
Following from codeigniter.com. Note, it uses the Laminas Escaper library.
esc($data[, $context = 'html'[, $encoding]])
Parameters
$data (string|array) – The information to be escaped.
$context (string) – The escaping context. Default is ‘html’.
$encoding (string) – The character encoding of the string.
Returns
The escaped data.
Return type
mixed
Escapes data for inclusion in web pages, to help prevent XSS attacks. This uses the Laminas Escaper library to handle the actual filtering of the data.
If $data is a string, then it simply escapes and returns it. If $data is an array, then it loops over it, escaping each ‘value’ of the key/value pairs.
Valid context values: html, js, css, url, attr, raw
From docs.laminas.dev
What laminas-Escaper is not
laminas-escaper is meant to be used only for escaping data for output, and as such should not be misused for filtering input data. For such tasks, use laminas-filter, HTMLPurifier or PHP's Filter functionality should be used.
Some of the functions they do are similar. Such as both may/will convert < to <. However, your stored data may not have come just from user input and it may have < in it. It is perfectly safe to store it this way
but it needs to be escaped for output otherwise the browser could get confused, thinking its html.
I think for this situation using esc is sufficient. FILTER_SANITIZE_SPECIAL_CHARS is a PHP sanitize filter that encode '"<>& and optionally strip or encode other special characters according to the flag. To do that you need to set the flag. It is third parameter in getPost() method. Here is an example
$this->request->getPost('field', FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_ENCODE_HIGH)
This flag can be change according to your requirements. You can use any PHP filter with a flag. Please refer php documentation for more info.
I was wondering if converting POST input from an HTML form into html entities, (via the PHP function htmlentities() or using the FILTER_SANITIZE_SPECIAL_CHARS constant in tandem with the filter_input() PHP function ), will help defend against any attacks where a user attempts to insert any JavaScript code inside the form field or if there's any other PHP based function or tactic I should employ to create a safe HTML form experience?
Sorry for the loaded run-on sentence question but that's the best I could word it in a hurry.
Any responses would be greatly appreciated and thanks to all in advance.
racl101
It would turn the following:
<script>alert("Muhahaha");</script>
into
<script>alert("Muhahaha");</script>
So if you're printing out this data into HTML later, you would be protected. It wouldn't protect you from:
"; alert("Muhahaha");
just in case you were echoing into a script like so:
var t = "Hello there <?php echo $str;?>";
For this purpose, you should use addslashes() and a database string escaping method like mysql_real_escape_string().
yes, that is one way to sanitise. it has the benefit that you can always display the database contents without fear of xss attacks. however, a 'purer' approach is to store the raw data in the database and sanitise in the view - so every time you want to show the text, use htmlentities() on it.
however, your approach does not take into account sql injection attacks. you might want to look at http://php.net/manual/en/function.mysql-real-escape-string.php to guard against that.
Yes, do this when you want to display data to a webpage, but I recommend you don't store the HTML in the database as encoded, this may seem fine for large text fields, but when you have shorter titles, say a 32 character, a normal 30 character string that contains an & would become & and this would either cause a SQL error or the data to be cut off.
So the rule of thumb is, store everything row (obviously prevent SQL injection) and treat EVERYTHING as tainted, no matter where it comes from: the database, user forms, rss feeds, flat files, XML, etc. This is how you build good security without worrying about the data overflowing, or the fact you might oneday need to extract the data to a non web user where the HTML encoding is a problem.
Quick question, is it a better idea to call htmlentities() (or htmlspecialchars()) before or after inserting data into the database?
Before: The new longer string will cause me to have to change the database to hold longer values in the field. (maxlength="800" could change to a 804 char string)
After: This will require a lot more server processing, and hundreds of calls to htmlspecialchars() could be made on every page load or AJAX load.
SOOO. Will converting when results are retrieved slow my code significantly? Should I change the DB?
I'd recommend storing the most raw form of the data in the database. That gives you the most flexibility when choosing how and where to output that data.
If you find that performance is a problem, you could cache the HTML-formatted version of this data somehow. Remember that premature optimization is a bad thing.
I have no experience of php but generally I always convert or escape nearest to output. You don't know when your output requirements will change, for example you may want to spit out data as XML, or JSON arrays and so escaping for HTML and then storing means you're limited to using the data as HTML alone.
In a php/MySQL web app, data flows in two ways
Database -> scripting language (php) -> HTML output -> browser ->screen
and
Keyboard-> browser-> $_POST -> php -> SQL statement -> database.
Data is defined as everything provided by the user.
ALWAYS ALWAYS ALWAYS....
A) process data through mysql_real_escape_string as you move it into an SQL statement, and
B) process data through htmlspecialchars as you move it into the HTML output.
This will protect you from sql injection attacks, and enable html characters and entities to display properly (unless you manage to forget one place, and then you have opened up a security hole).
Did I mention that this has to be done for every single piece of data any user could ever have touched, altered or provided via a script?
p.s. For performance reasons, use UTF-8 encoding everywhere.
It's best to store text as raw and encode it as needed, to be honest, you always need to htmlencode your data anyways when you're outputting it to the wbe page to prevent XSS hacking.
You shouldn't encode your data before you put it in the database. The main reason are:
If such data is near the column size limit, say 32 chars, if the title was "Steve & Fred blah blah" then you might go over that column limit because a 1 char & becomes a 5 char & amp;
You are assuming the data will always be displayed in a web page, in the future you never know where you'll be looking at the data and you might not want it encoded, now you have to decode it and it's possible you might not have access to PHP's decode function
It is the way of the craftsman to "measure twice, optimize once".
If you don't need high performance for your website, store it as raw data and when you output it do what you want.
If you need performance then consider storing it twice: raw data to do what you want with it and another field with the filtered data. It could be seen as redundant, but CPU is expensive, while data storage is really cheap.
The easiest way is store the data "as is" and then convert to htmlentities wherever it is needed.
The safest solution is to filter the data before it goes in into the Database as this prevents possible attacks on your server and database from the lack of security implementation, and then convert it however you need when needed. Also if you are using PDO this will happen automatically for you using prepared statements.
http://php.net/PDO
We had this debate at work recently. We decided to store the escaped values in the database, because before (when we were storing it unescaped) there were corner cases where data was being displayed without being escaped. This can lead to XSS. So we decided to store it escaped to be safe, and if you want it unescaped you have to do the work yourself.
Edit: So to everyone who disagrees, let me add some backstory for my case. Let's say you're working in a team of 50+ people... and data from the database is not guaranteed to be HTML-Encoded on the way out - there's no built-in mechanism for it so the developer has to write the code to do it. And this data is shown all over the place so it's not going through 1 developer's code it's going through 30's - most of whom have no clue about this data (or that it could even contain angle brackets which is rare) and merely want to get it shown on the page, move on, and forget about it.
Do you still think it's better to put the data, in HTML, into the database and rely on random people who are not-you to do things properly? Because frankly, while it certainly may not seem warm-fuzzy-best-practicey, I prefer to fail closed (meaning when the data comes through in a Word Doc it looks like Value<Stock rather than Value<Stock) rather than open (so the Word Doc looks right with no work, but some corner of the platform may/likely-is vulnerable to XSS). You can't have both.
I am trying to figure out what is the best way to manage the data a user inputs concerning non desirable tags he might insert:
strip_tags() - the tags are removed and they are not inserted in the database
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
What's the better, and is there any disadvantage in any of these?
Regards
This depends on what your priority is:
if it's important to display special characters from user input (like on StackOverflow, for example), then you'll need to store this information in the database and sanitize it on display - in this case, you'll want to at least use htmlspecialchars() to display the output (if not something more sophisticated)
if you just want plain text comments, use strip_tags() before you stick it in the database - this way you'll reduce the amount of data that you need to store, and reduce processing time when displaying the data on the screen
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
This. You usually want people to be able to type less-than signs and ampersands and have them displayed as such on the page. htmlspecialchars on every text-to-HTML output step (whether that text came directly from user input, or from the database, or from somewhere else entirely) is the right way to achieve this. Messing about with the input is a not-at-all-appropriate tactic for dealing with an output-encoding issue.
Of course, you will need a different escape — or parameterisation — for putting text in an SQL string.
The measures taken to secure user input depends entirely on in what context the data is being used. For instance:
If you're inserting it into a SQL database, you should use parameterized statements. PHP's mysql_real_escape_string() works decently, as well.
If you're going to display it on an HTML page, then you need to strip or escape HTML tags.
In general, any time you're mixing user input with another form of mark-up or another language, that language's elements need to be escaped or stripped from the input before put into that context.
The last point above segues into the next point: Many feel that the original input should always be maintained. This makes a lot of sense when, later, you decide to use the data in a different way and, for instance, HTML tags aren't a big deal in the new context. Also, if your site is in some way compromised, you have a record of the exact input given.
Specifically related to HTML tags in user input intended for display on an HTML page: If there is any conceivable reason for a user to input HTML tags, then simply escape them. If not, strip them before display.
As I prepare to tackle the issue of input data filtering and sanitization, I'm curious whether there's a best (or most used) practice? Is it better to filter/sanitize the data (of HTML, JavaScript, etc.) before inserting the data into the database, or should it be done when the data is being prepared for display in HTML?
A few notes:
I'm doing this in PHP, but I suspect the answer to this is language agnostic. But if you have any recommendations specific to PHP, please share!
This is not an issue of escaping the data for database insertion. I already have PDO handling that quite well.
Thanks!
When it comes to displaying user submitted data, the generally accepted mantra is to "Filter input, escape output."
I would recommend against escaping things like html entities, etc, before going into the database, because you never know when HTML will not be your display medium. Also, different types of situations require different types of output escaping. For example, embedding a string in Javascript requires different escaping than in HTML. Doing this before may lull yourself into a false sense of security.
So, the basic rule of thumb is, sanitize before use and specifically for that use; not pre-emptively.
(Please note, I am not talking about escaping output for SQL, just for display. Please still do escape data bound for an SQL string).
i like to have/store the data in original form.
i only escape/filter the data depending on the location where i'm using it.
on a webpage - encode all html
on sql - kill quotes
on url - urlencoding
on printers - encode escape commands
on what ever - encode it for that job
There are at least two types of filtering/sanitization you should care about :
SQL
HTML
Obviously, the first one has to be taken care of before/when inserting the data to the database, to prevent SQL Injections.
But you already know that, as you said, so I won't talk about it more.
The second one, on the other hand, is a more interesting question :
if your users must be able to edit their data, it is interesting to return it to them the same way they entered it at first ; which means you have to store a "non-html-specialchars-escaped" version.
if you want to have some HTML displayed, you'll maybe use something like HTMLPurifier : very powerful... But might require a bit too much resources if you are running it on every data when it has to be displayed...
So :
If you want to display some HTML, using a heavy tool to validate/filter it, I'd say you need to store an already filtered/whatever version into the database, to not destroy the server, re-creating it each time the data is displayed
but you also need to store the "original" version (see what I said before)
In that case, I'd probably store both versions into database, even if it takes more place... Or at least use some good caching mecanism, to not-recreate the clean version over and over again.
If you don't want to display any HTML, you will use htmlspecialchars or an equivalent, which is probably not that much of a CPU-eater... So it probably doesn't matter much
you still need to store the "original" version
but escaping when you are outputing the data might be OK.
BTW, the first solution is also nice if users are using something like bbcode/markdown/wiki when inputting the data, and you are rendering it in HTML...
At least, as long as it's displayed more often than it's updated -- and especially if you don't use any cache to store the clean HTML version.
Sanitize it for the database before putting it in the database, if necessary (i.e. if you're not using a database interactivity layer that handles that for you). Sanitize it for display before display.
Storing things in a presently unnecessary quoted form just causes too many problems.
I always say escape things immediately before passing them to the place they need to be escaped. Your database doesn't care about HTML, so escaping HTML before storing in the database is unnecessary. If you ever want to output as something other than HTML, or change which tags are allowed/disallowed, you might have a bit of work ahead of you. Also, it's easier to remember to do the escaping right when it needs to be done, than at some much earlier stage in the process.
It's also worth noting that HTML-escaped strings can be much longer than the original input. If I put a Japanese username in a registration form, the original string might only be 4 Unicode characters, but HTML escaping may convert it to a long string of "〹𐤲䡈穩". Then my 4-character username is too long for your database field, and gets stored as two Japanese characters plus half an escape code, which also probably prevents me from logging in.
Beware that browsers tend to escape some things like non-English text in submitted forms themselves, and there will always be that smartass who uses a Japanese username everywhere. So you may want to actually unescape HTML before storing.
Mostly it depends on what you are planning to do with the input, as well as your development environment.
In most cases you want original input. This way you get the power to tweak your output to your heart's content without fear of losing the original. This also allows you to troubleshoot issues such as broken output. You can always see how your filters are buggy or customer's input is erroneous.
On the other hand some short semantic data could be filtered immediately. 1) You don't want messy phone numbers in database, so for such things it could be good to sanitize. 2) You don't want some other programmer to accidentally output data without escaping, and you work in multiprogrammer environment. However, for most cases raw data is better IMO.