When I use Zend DB (PDO Mysql Adapter), I'm getting back results that are not only escaped, but also HTML url-encoded.
I'm inserting the rows into the database as is, not escaped or html encoded. I'm curious to know:
How I can get back results that aren't escaped and html encoded,
If I should be doing something to treat the data before inserts,
And if it isn't possible to get back results that aren't escaped and html url encoded, how to do it myself.
I'd like to know if retrieving the results as escaped and html encoded is actually the proper way to do things?
Actually I think you just wonder why the strings are encoded.
AFAIK ZF does not does how you describe is does. So the "error" must be somewhere in your data-processing.
Additional Note: To improve your questions in the future, I would ask more straight forward and provide some code what you actually do otherwise your question is much of an invitation to guess around. With a concrete example it often works best.
Related
I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.
I'm storing data in a MySQL database that may have some special characters. I'm wondering how to store it so that these characters are preserved if they're either output to HTML via PHP OR via JavaScript, e.g. createTextNode.
For example, the division symbol (÷) has the html code ÷, and when I store it as that it shows up fine when put directly into HTML by PHP, but when I pull it into JavaScript using $.getJSON and then insert it with createTextNode it shows up looking like ÷.
I also tried storing the symbol in the SQL directly, but my understanding is that the column would need to be changed from VARCHAR to NVARCHAR and that would cause a performance hit that doesn't seem necessary.
Given that I can modify the SQL, the PHP, or the JavaScript, is there an easy fix here? Maybe a way to unescape the HTML entity in JavaScript?
As answered by Yogesh, you should switch your collation of the DB to utf8_general_ci
So there's probably two things going on:
JSON escapes special characters.
Somewhere, something in your code flow is URL encoding the strings too.
So you just need to decode the string in your JavaScript, or you need to find what part of your code is URL encoding those strings and fix it.
I am working on a PHP/MySQL script that is inserting data into a database like this...
Caesar (courtesy post)
I know this is a basic question but how can I prevent the special characters from doing that?
It seems you're not just HTML-escaping your content once, but actually doing it twice. The first thing you should do is try to find out why your content ends up that way, instead of attempting to decode it to an unescaped format. You should always escape for the format you're going to use the data in, escape with the SQL escape functions when inserting, and escape with htmlspecialchars (or a similar function) when presenting the data in HTML (and take note of the character encoding used).
If the data comes in this format from another source, use html_entity_decode to normalize the text again. That does however seem weird.
As I prepare to tackle the issue of input data filtering and sanitization, I'm curious whether there's a best (or most used) practice? Is it better to filter/sanitize the data (of HTML, JavaScript, etc.) before inserting the data into the database, or should it be done when the data is being prepared for display in HTML?
A few notes:
I'm doing this in PHP, but I suspect the answer to this is language agnostic. But if you have any recommendations specific to PHP, please share!
This is not an issue of escaping the data for database insertion. I already have PDO handling that quite well.
Thanks!
When it comes to displaying user submitted data, the generally accepted mantra is to "Filter input, escape output."
I would recommend against escaping things like html entities, etc, before going into the database, because you never know when HTML will not be your display medium. Also, different types of situations require different types of output escaping. For example, embedding a string in Javascript requires different escaping than in HTML. Doing this before may lull yourself into a false sense of security.
So, the basic rule of thumb is, sanitize before use and specifically for that use; not pre-emptively.
(Please note, I am not talking about escaping output for SQL, just for display. Please still do escape data bound for an SQL string).
i like to have/store the data in original form.
i only escape/filter the data depending on the location where i'm using it.
on a webpage - encode all html
on sql - kill quotes
on url - urlencoding
on printers - encode escape commands
on what ever - encode it for that job
There are at least two types of filtering/sanitization you should care about :
SQL
HTML
Obviously, the first one has to be taken care of before/when inserting the data to the database, to prevent SQL Injections.
But you already know that, as you said, so I won't talk about it more.
The second one, on the other hand, is a more interesting question :
if your users must be able to edit their data, it is interesting to return it to them the same way they entered it at first ; which means you have to store a "non-html-specialchars-escaped" version.
if you want to have some HTML displayed, you'll maybe use something like HTMLPurifier : very powerful... But might require a bit too much resources if you are running it on every data when it has to be displayed...
So :
If you want to display some HTML, using a heavy tool to validate/filter it, I'd say you need to store an already filtered/whatever version into the database, to not destroy the server, re-creating it each time the data is displayed
but you also need to store the "original" version (see what I said before)
In that case, I'd probably store both versions into database, even if it takes more place... Or at least use some good caching mecanism, to not-recreate the clean version over and over again.
If you don't want to display any HTML, you will use htmlspecialchars or an equivalent, which is probably not that much of a CPU-eater... So it probably doesn't matter much
you still need to store the "original" version
but escaping when you are outputing the data might be OK.
BTW, the first solution is also nice if users are using something like bbcode/markdown/wiki when inputting the data, and you are rendering it in HTML...
At least, as long as it's displayed more often than it's updated -- and especially if you don't use any cache to store the clean HTML version.
Sanitize it for the database before putting it in the database, if necessary (i.e. if you're not using a database interactivity layer that handles that for you). Sanitize it for display before display.
Storing things in a presently unnecessary quoted form just causes too many problems.
I always say escape things immediately before passing them to the place they need to be escaped. Your database doesn't care about HTML, so escaping HTML before storing in the database is unnecessary. If you ever want to output as something other than HTML, or change which tags are allowed/disallowed, you might have a bit of work ahead of you. Also, it's easier to remember to do the escaping right when it needs to be done, than at some much earlier stage in the process.
It's also worth noting that HTML-escaped strings can be much longer than the original input. If I put a Japanese username in a registration form, the original string might only be 4 Unicode characters, but HTML escaping may convert it to a long string of "〹𐤲䡈穩". Then my 4-character username is too long for your database field, and gets stored as two Japanese characters plus half an escape code, which also probably prevents me from logging in.
Beware that browsers tend to escape some things like non-English text in submitted forms themselves, and there will always be that smartass who uses a Japanese username everywhere. So you may want to actually unescape HTML before storing.
Mostly it depends on what you are planning to do with the input, as well as your development environment.
In most cases you want original input. This way you get the power to tweak your output to your heart's content without fear of losing the original. This also allows you to troubleshoot issues such as broken output. You can always see how your filters are buggy or customer's input is erroneous.
On the other hand some short semantic data could be filtered immediately. 1) You don't want messy phone numbers in database, so for such things it could be good to sanitize. 2) You don't want some other programmer to accidentally output data without escaping, and you work in multiprogrammer environment. However, for most cases raw data is better IMO.
I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.
One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.
You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars
For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.
A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)
My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).
You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)
If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..
The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.
OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).
Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.
there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.