I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.
Related
After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?
If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.
There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.
I am working on a PHP/MySQL script that is inserting data into a database like this...
Caesar (courtesy post)
I know this is a basic question but how can I prevent the special characters from doing that?
It seems you're not just HTML-escaping your content once, but actually doing it twice. The first thing you should do is try to find out why your content ends up that way, instead of attempting to decode it to an unescaped format. You should always escape for the format you're going to use the data in, escape with the SQL escape functions when inserting, and escape with htmlspecialchars (or a similar function) when presenting the data in HTML (and take note of the character encoding used).
If the data comes in this format from another source, use html_entity_decode to normalize the text again. That does however seem weird.
I m creating page in which user enters commnets and that comments are inserted into DB(mysql). These comments can contain single,double quotes or any special chars. To escape these I used following code
$str = mysql_real_escape_string($str,$conn);
here $conn is active connection resource, $str is string content from textarea
This works fine and return perfectly escaped string that I can insert into DB. But if user typed his/her comments into text editor like openoffice writer or msword and use this text from it, the error occur and gives error as follow while inserting in DB
Incorrect string value: '\x93testi...' for column 'commnets' at row 1
I think this is happening because single-double quotes in text that are coming from text editor(openoffice, msword) is not escaped properly. So How do I escape it to insert it into DB. Please help me
Thanks in advance.....
You aren't submitting a valid UTF8 string to be saved in the DB. Instead it's probably a windows specific character set.
Presumably your users are submitting the text through a web page - you need to make sure that you serve the page in UTF8 and when the form is submitted it is also in UTf8 (which it will be by default if the page is served in UTF8).
You need to:
Make sure you're sending the UTF-8 charset in the headers.
header("Content-Type:text/html; charset=UTF-8");
And/or set the content type in your section of your page
btw mysql_real_escape_string is not really anything to do with the problem here. That function is used to prevent strings containing normal quotes from being used to do SQL injection attacks, which is better solved by using prepared statements anyway.
There is one way to sidestep all this real_escape malarkey and inject INTO sql what is actually supplied, and that is to use mysql's ability to interpret a hexadecimal number of arbitrary length as a string.
e.g.
$query=sprintf("update module set code=0x%s where id='%d'", bin2hex($code), $id);
This works even if code is a BLOB type binary field and $code is full binary data (e.g, an image file contents).
You will also sidestep any sql injection with this.
I have found that using sprintf to format queries is extremely powerful and safe and use of the php bin2hex() renders anything up to and including binary able to get into the database untainted.
Getting it out is somewhat another matter mind you..
I am using codeigniter in an app. There is a form. In the textarea element, I wrote something including
%Features%
However, when I want to echo this by $this->input->post(key), I get something like
�atures%
The '%Fe' are vanished.
In main index.php file of CI, I tried var_dump($_POST) and I see the above word is fully ok. but when I am fetching it with the input library (xss filtering is on) I get the problem.
When the XSS filtering is off, it appears ok initially. however, if I store it in database and show next time, I see same problem (even the xss filtering is off).
%Fe happens to look like a URL-encoded sequence %FE, representing character 254. It's being munched into the Unicode "I have no idea what that sequence means" glyph, �.
It's clear that the "XSS filter" is being over-zealous when decoding the field on submission.
It's also very likely that a URL-decode is being run again at some point later in the process, when you output the result from the database. Check the database to make sure that the actual string is being represented properly.
First: Escape the variables before storing them into db. % has special meaning in SQL.
Second: % also has special meaning in URLs eg. %20 is %FE will map to some character which will be decoded by input()
Using PHP against a UTF-8 compliant database. Here's how input goes in.
user types input into textarea
textarea encoded with javascript escape()
passed via HTTP post
decoded with PHP rawurldecode()
passed through HTMLPurifier with default settings
escaped for MySQL and stored in database
And it comes out in the usual way and I run unescape() on page load. This is to allow people to, say, copy and paste directly from a word document and have the smart quotes show up.
But HTMLPurifier seems to be clobbering non-UTF-8 special characters, ones that escape() to a simple % expression, like Ö, which escapes to %D6, whereas smartquotes escape to %u2024 or something and go into the database that way. It takes out both the special character and the one immediately following.
I need to change something in this process. Perhaps I need to change multiple things.
What can I do to not get special characters clobbered?
textarea encoded with javascript escape()
escape isn't safe for non-ascii. Use escapeURIComponent
passed via HTTP post
I assume that you use XmlHttpRequest? If not, make sure that the page containing the form is served as utf-8.
decoded with PHP rawurldecode()
If you access the value through $_POST, you should not decode it, since that has already been done. Doing so will mess up data.
escaped for MySQL and stored in database
Make sure you don't have magic quotes turned on. Make sure that the database stores tables as utf-8 (The encoding and the collation must be both utf-8). Make sure that the connection between php and MySql is utf-8 (Use set names utf8, if you don't use PDO).
Finally, make sure that the page is served as utf-8 when you output the string again.