I'm building a category list and I'm unsure how to store the Ampersand symbol
in MySQL database. Is there any problem/disadvantage if I use '&'. Are there any differences
from using it in a html format '&'?
Using '&' saves a few bytes. It is not a special character for MySQL. You can easily produce the & for output later with PHP's method htmlspecialchars(). When your field is meant to keep simple information, you should use plain text only, as this is in general more flexible since you can generate different kinds of markup etc. later. Exception: the markup is produced by a user whose layout decisions you want to save with the text (as in rich-text input). If you have tags etc. in your input, you may want to use & for consistency.
You should store it as & only if the field in the DB contains HTML (like <span class="bold">some text</span> & more). In which case you should be very careful about XSS.
If the field contains some general data (like an username, title... etc) you should only escape it when you put it in your HTML (using htmlentities for example).
Storing it as & is an appropriate method. You can echo it or use it in statements as &.
We store '&' into database fields all the time, it's fine to do so (at-least I've never heard an argument otherwise).
If you're only ever using the string in a HTML page you could just store the HTML safe & version I suppose. I would suggest that storing '&' and escaping it when you read it would be better though (in-case you need to use the string in a non-HTML context in the future).
Use & if you want to have a valid HTML or avoid problems, like cut© (browser shows it as cut©).
Related
I have a user input field which will be stored into a 'tinytext' field in a MySQL database; pretty standard stuff. I am wondering if there is some sort of standard or best-practice to adhere to when it comes to escaping html special characters using the php function htmlentities()?
Should I use htmlentities() before I store the data in the database or should I run the function on the data ever time it is output from the website?
There is usually no reason to use htmlentities() at all any more. Just store everything in UTF-8 fields and adhere to UTF-8 all the way through.
When outputting unsafe user input as HTML, use htmlspecialchars(), ideally at the time of output so you have a copy of the original data.
What's the best route for storing data in MySQL. With MySQL should I just use, TEXT as my field type?
As well when using mysql_real_escape_string() with return'ed values \r\n .
But should I be running the htmlentities() on it after that?
And then when I return data to the screen I should use, NL2BR()?
Just trying to figure out the best route here for storing this information.
Thank you for your help!
TEXT or TINYTEXT or anything similar should be fine for storing ASCII data from the user. If you don't need a lot of space you may think about VARCHAR
i think that mysql_real_escape_string() escapes characters that may compromise the security of an SQL query (single quote, double quote, etc.) but doesn't do much more than that.
htmlentities() converts reserved html characters like < and > into their html encoded equivalent, < and > respectively. These characters are not dangerous for SQL queries so you probably do not need to escape them unless you want to display the HTML tag entered by the user as text, and not let it be interpreted as HTML.
NL2BR() is probably not necessary either.
Most importantly, your decision on when to use each of these functions will depend on your end application. You may need / want some but not others ( though you should definitely use mysql_real_escape_string() )
Really depends on what you are trying to store. For things such as usernames, passwords, etc... then you can use varchar. But if your storing long text such as news posts or html data, then you can use TEXT or LONG TEXT (Depending on how long it is).
You should ALWAYS use mysql_real_escape_string() when inserting into the DB. If you're outputting HTML from the DB, you may wan to run htmlentities or html_specialchars to ensure that you aren't outputting user injected javascript that could redirect your users to hacker websites and such.
One other idea is that you could escape your data using htmlentities before inserting into the DB, but it's your choice.
NL2BR is great for forcing all \r\n to tags instead.
So, it seems like your on the right track...
I am trying to figure out what is the best way to manage the data a user inputs concerning non desirable tags he might insert:
strip_tags() - the tags are removed and they are not inserted in the database
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
What's the better, and is there any disadvantage in any of these?
Regards
This depends on what your priority is:
if it's important to display special characters from user input (like on StackOverflow, for example), then you'll need to store this information in the database and sanitize it on display - in this case, you'll want to at least use htmlspecialchars() to display the output (if not something more sophisticated)
if you just want plain text comments, use strip_tags() before you stick it in the database - this way you'll reduce the amount of data that you need to store, and reduce processing time when displaying the data on the screen
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
This. You usually want people to be able to type less-than signs and ampersands and have them displayed as such on the page. htmlspecialchars on every text-to-HTML output step (whether that text came directly from user input, or from the database, or from somewhere else entirely) is the right way to achieve this. Messing about with the input is a not-at-all-appropriate tactic for dealing with an output-encoding issue.
Of course, you will need a different escape — or parameterisation — for putting text in an SQL string.
The measures taken to secure user input depends entirely on in what context the data is being used. For instance:
If you're inserting it into a SQL database, you should use parameterized statements. PHP's mysql_real_escape_string() works decently, as well.
If you're going to display it on an HTML page, then you need to strip or escape HTML tags.
In general, any time you're mixing user input with another form of mark-up or another language, that language's elements need to be escaped or stripped from the input before put into that context.
The last point above segues into the next point: Many feel that the original input should always be maintained. This makes a lot of sense when, later, you decide to use the data in a different way and, for instance, HTML tags aren't a big deal in the new context. Also, if your site is in some way compromised, you have a record of the exact input given.
Specifically related to HTML tags in user input intended for display on an HTML page: If there is any conceivable reason for a user to input HTML tags, then simply escape them. If not, strip them before display.
When outputting user input, do you only use htmlspecialchars() or are there are functions/actions/methods you also run? I'm looking for something that will also deal with XSS.
I'm wondering if I should write a function that escapes user input on output or just use htmlspecialchars(). I'm looking for the generic cases, not the specific cases that can be dealt with individually.
I usually use
htmlspecialchars($var, ENT_QUOTES)
on input fields. I created a method that does this because i use that a lot and it makes the code shorter and more readable.
Lets have a quick review of WHY escaping is needed in different contexts:
If you are in a quote delimited string, you need to be able to escape the quotes.
If you are in xml, then you need to separate "content" from "markup"
If you are in SQL, you need to separate "commands" from "data"
If you are on the command line, you need to separate "commands" from "data"
This is a really basic aspect of computing in general. Because the syntax that delimits data can occur IN THE DATA, there needs to be a way to differentiate the DATA from the SYNTAX, hence, escaping.
In web programming, the common escaping cases are:
1. Outputting text into HTML
2. Outputting data into HTML attributes
3. Outputting HTML into HTML
4. Inserting data into Javascript
5. Inserting data into SQL
6. Inserting data into a shell command
Each one has a different security implications if handled incorrectly. THIS IS REALLY IMPORTANT! Let's review this in the context of PHP:
Text into HTML:
htmlspecialchars(...)
Data into HTML attributes
htmlspecialchars(..., ENT_QUOTES)
HTML into HTML
Use a library such as HTMLPurifier to ENSURE that only valid tags are present.
Data into Javascript
I prefer json_encode. If you are placing it in an attribute, you still need to use #2, such as
Inserting data into SQL
Each driver has an escape() function of some sort. It is best. If you are running in a normal latin1 character set, addslashes(...) is suitable. Don't forget the quotes AROUND the addslashes() call:
"INSERT INTO table1 SET field1 = '" . addslashes($data) . "'"
Data on the command line
escapeshellarg() and escapeshellcmd() -- read the manual
--
Take these to heart, and you will eliminate 95%* of common web security risks! (* a guess)
You shouldn't be cleansing text on output, it should happen on input. I use a filter that filters all input to the application. It is configurable so that it can allow specific tags/data through when needed (say for a wysiwig editor).
You should do as little processing of text on output as possible so that you ensure speed. Processing it once creates a lot less strain then processing it 500,0000 times.
I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.
One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.
You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars
For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.
A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)
My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).
You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)
If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..
The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.
OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).
Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.
there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.