I have a form text field that accepts a url. When the form is submitted, I insert this field into the database with proper anti-sql-injection. My question though is about xss.
This input field is a url and I need to display it again on the page. How do I protect it from xss on the way into the database (I think nothing is needed since I've already taken care of sql injection) and on the way out of the database?
Let's pretend we have it like this, I'm simplifying it, and please don't worry about sql injection. Where do I go from here after that?
$url = $_POST['url'];
Thanks
Assuming this is going to be put into HTML content (such as between <body> and </body> or between <div> and </div>), you need to encode the 5 special XML characters (&, <, >, ", '), and OWASP recommends including slash (/) as well. The PHP builtin, htmlentities() will do the first part for you, and a simple str_replace() can do the slash:
function makeHTMLSafe($string) {
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = str_replace('/', '/', $string);
return $string;
}
If, however, you're going to be putting the tainted value into an HTML attribute, such as the href= clause of an <a, then you'll need to encode a different set of characters ([space] % * + , - / ; < = > ^ and |)—and you must double-quote your HTML attributes:
function makeHTMLAttributeSafe($string) {
$scaryCharacters = array(32, 37, 42, 43, 44, 45, 47, 59, 60, 61, 62, 94, 124);
$translationTable = array();
foreach ($scaryCharacters as $num) {
$hex = str_pad(dechex($num), 2, '0', STR_PAD_LEFT);
$translationTable[chr($num)] = '&#x' . $hex . ';';
}
$string = strtr($string, $translationTable);
return $string;
}
The final concern is illegal UTF-8 characters—when delivered to some browsers, an ill-formed UTF-8 byte sequence can break out of an HTML entity. To protect against this, simply ensure that all the UTF-8 characters you get are valid:
function assertValidUTF8($string) {
if (strlen($string) AND !preg_match('/^.{1}/us', $string)) {
die;
}
return $string;
}
The u modifier on that regular expression makes it a Unicode matching regex. By matching a single chararchter, ., we're assured that the entire string is valid Unicode.
Since this is all context-dependent, it's best to do any of this encoding at the latest possible moment—just before presenting output to the user. Being in this practice also makes it easy to see any places you've missed.
OWASP provides a great deal of information on their XSS prevention cheat sheet.
You need to encode it with htmlspecialchars before displaying to a user. Usually this is enough when dealing with data outside of <script> tag and/or HTML tag attributes.
Don't roll your own XSS-protection, there are too many ways something might slip trough (I can't find the link to a certain XSS-demopage anymore, but the amount of possibilities is staggering: Broken IMG-tags, weird attributes etc.).
Use an existing library like sseq-lib or extract one from an established framework.
Update: Here's the XSS-demopage.
Related
I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_#0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&##/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.
If you have already passed the input through htmlspecialchars (which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '' . $matches[1] . '';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.
I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?
I posted this question a while back and it is working great for finding and 'linkifying' links from user generated posts.
Linkify Regex Function PHP Daring Fireball Method
<?php
if (!function_exists("html")) {
function html($string){
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
}
if ( false === function_exists('linkify') ):
function linkify($str) {
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
}
endif;
echo "<div>" . linkify(html($row_rsgetpost['userinput'])) . "</div>";
?>
I am concerned that I may be introducing a security risk by inserting user generated content into a link. I am already escaping user content coming from my database with htmlspecialchars($string, ENT_QUOTES, 'UTF-8') before running it through the linkify function and echoing back to the page, but I've read on OWASP that link attributes need to be treated specially to mitigate XSS. I am thinking this function is ok since it places the user-generated content inside double quotes and has already been escaped with htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), but would really appreciate someone with xss expertise to confirm this. Thanks!
First of data must NEVER be escaped before entering the database, this is very serious mistake. This is not only insecure, but it breaks functionality. Chaining the values of strings, is data corruption and affects string comparison. This approach is insecure because XSS is an output problem. When you are inserting data into the database you do not know where it appears on the page. For instance, even if you where this function the following code is still vulnerable to XSS:
For example:
<a href="javascript:alert(1)" \>
In terms of your regular expression. My initial reaction was, well this is a horrible idea. No comments on how its supposed to work and heavy use of NOT operators, blacklisting is always worse than white-listing.
So I loaded up Regex Buddy and in about 3 minutes I bypassed your regex with this input:
https://test.com/test'onclick='alert(1);//
No developer wants to write a vulnerably, so they are caused with a breakdown in how programmer thinks his application is working, and how it actually works. In this case i would assume you never tested this regex, and its a gross oversimplification of the problem.
HTMLPurifer is a php library designed to clean HTML, it consist of THOUSANDS of regular expressions. Its very slow, and is bypassed on a fairly regular basis. So if you go this route, make sure to update regularly.
In terms of fixing this flaw i think your best off using htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), and then enforcing that the string start with 'http'. HTML encoding is a form of escaping, and the value will be automatically decoded such that the URL is unmolested.
Because the data is going into an attribute, it should be url (or percent) encoded:
return '' . "$input";
Technically it should also then be html encoded
return '' . "$input";
but no browsers I know of care and consequently no-one does it, and it sounds like you might be doing this step already and you don't want to do this twice
Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.
The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that #Rook has been mentioning.
So I cannot really think of a way how an XSS attack could be performed the following code as suggested by #tobyodavies, but without urlencode, which does something else:
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
Note that I have also a added a small shortcut for checking the http prefix.
Now the anchor links that you generate are safe.
However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.
Firstly, as the PHP documentation states htmlspecialchars only escapes
" '&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or ') only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
". javascript: is still used in regular programming, so why : isn't escaped is beyond me.
Secondly, if !html only expects the characters you think will be entered, not the representation of those characters that can be entered and are seen as valid. the utf-8 character set, and every other character set supports multiple representations for the same character. Also, your false statement allows 0-9 and a-z, so you still have to worry about base64 characters. I'd call your code a good attempt, but it needs a ton of refining. That or you could just use htmlpurifier, which people can still bypass. I do think that it is awesome that you set the character set in htmlspecialchars, since most programmers don't understand why they should do that.
Well, the title is my question. Can anybody give me a list of things to do to sanitize my data before entering to mysql database using php, especially if the data contains html tags?
It depends on a lot of things. If you don't want to accept any HTML, that makes it a whole lot easier, run it through strip_tags() first to remove all the HTML from it. After that it's much safer. If you do want to accept some HTML, you can selectively keep some tags from it with the same function, just add in the tags to keep after. eg: strip_tags($string_to_sanitize, '<p><div>'); // Keeps only <p> and <div> tags.
As for inserting into a database, it's always best to sanitize anything before inserting into the database; adopting a "don't trust anybody" mentality will save you a lot of trouble. Preventing against SQL injection is fairly straightforward, this is the method I use:
$q = sprintf("INSERT INTO table_name (string_field, int_field) VALUES ('%s', %d);",
mysql_real_escape_string($values['string']),
mysql_real_escape_string($values['number']));
$result = mysql_query($q, $connection)
Generally once you open the door for allowing HTML in, you'll have a whole deal of things to worry about (there are some great articles on defending from XSS out there). If you want to test for XSS vulnerabilities, try the examples on http://ha.ckers.org/xss.html. There are some they have there that you would probably never even consider, so give it a look!
Also, if you are accepting specific types of input (eg: numbers, emails, boolean values) try using the inbuilt filter_var() function in PHP. They have a bunch of inbuilt types to validate data against (http://www.php.net/manual/en/filter.filters.validate.php), as well as a number of filters to sanitize your data (http://www.php.net/manual/en/filter.filters.sanitize.php).
Generally, accepting any input is like opening a Pandora's Box, and while you'll probably never be able to block 100% of the weaknesses (people are always looking to find a way in), you can block the common ones to save you headaches.
Finally remember to sanitize ALL external data. Just because you make a dropdown input doesn't mean some shady person can't send their own data instead!
Use mysql_real_escape_string();
mysql_query("INSERT INTO table(col) VALUES('".mysql_real_escape_string($_POST['data']."')");
You should use prepared statements when inserting data into the database, not any sort of escaping. (PHP manual: prepared statements in pdo and mysqli.)
Sanitization for HTML output should, as mentioned by others, happen when you go to take data out of the database and merge it into a page, not before.
Turn off register_globals and magic_quotes, use mysql_real_escape_string on any string coming from the user before placing it into your query.
Of course mysql_real_escape_string
When dealing with any kind of input start from the I won't allow anything stand point and whitelist only that deemed to be acceptable.
On insert you need to make sure that the data is MySQL-escaped. For this, use mysql_real_escape_string.
Before showing the data you will need to strip out unsafe HTML and/or JavaScript code. Many people choose to store the sanitised version in the database. Other prefer to strip the ugly HTML from the string before rendering.
You do this in PHP with some filtering. an example is the Drupal filter_xss function:
function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) {
// Only operate on valid UTF-8 strings. This is necessary to prevent cross
// site scripting issues on Internet Explorer 6.
if (!drupal_validate_utf8($string)) {
return '';
}
// Store the input format
_filter_xss_split($allowed_tags, TRUE);
// Remove NUL characters (ignored by some browsers)
$string = str_replace(chr(0), '', $string);
// Remove Netscape 4 JS entities
$string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string);
// Defuse all HTML entities
$string = str_replace('&', '&', $string);
// Change back only well-formed entities in our whitelist
// Decimal numeric entities
$string = preg_replace('/&#([0-9]+;)/', '&#\1', $string);
// Hexadecimal numeric entities
$string = preg_replace('/&#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string);
// Named entities
$string = preg_replace('/&([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string);
return preg_replace_callback('%
(
<(?=[^a-zA-Z!/]) # a lone <
| # or
<!--.*?--> # a comment
| # or
<[^>]*(>|$) # a string that starts with a <, up until the > or the end of the string
| # or
> # just a >
)%x', '_filter_xss_split', $string);
}
well, there is not too much to do while we're talking of inserting data from textarea to mysql database.
For the strings placed into query, Mysql requirements are not so complicated.
Only 2 rules to follow:
inserted data should be surrounded by quotes.
some special character in the data should be escaped.
Note that this operation has nothing to do with security. It's syntax requirements.
Assuming you're adding quotes already, the only thing you have to add is escaping. Depends on your encoding, you can use addslashes or mysql_escape_string or mysql_real_escape_string functions.
However, other parts of query require more attention. If you're curious, refer to my earlier answer with complete guide: In PHP when submitting strings to the database should I take care of illegal characters using htmlspecialchars() or use a regular expression?
HTML tags has nothing to do with database and require no special attention.
However, for displaying data from untrusted source, some precautions should be taken. It was described in this topic already, only I have to add is you can't trust to strip_tags when used with second parameter.
You can use mysql_real_escape_string, you can also use htmlentities with addslashes... or you can use all 3 together also...
I have a website where I can't use html_entities() or html_specialchars() to process user input data. Instead, I added a custom function, which in the end is a function, which uses an array $forbidden to clean the input string of all unwanted characters. At the moment I have '<', '>', "'" as unwanted characters because of sql-injection/browser hijacking. My site is encoded in utf-8 - do I have to add more characters to that array, i.e. the characters '<', encoded in other charsets?
Thanks for any help,
Maenny
htmlentities nor htmlspecialchars functions has nothing to do with sql injection
to prevent injection, you have to follow some rules, I've described them all here
to filter HTML you may use htmlspecialchars() function, it will harm none of your cyrillic characters
You should escape ", too. It is much more harm than ', because you often enclose HTML attributes in ". But, why don't you simlpy use htmlspecialchars to do that job?
Futhermore: It isn't good to use one escaping function for both SQL and HTML. HTML needs escaping of tags, whereas SQL does not. So it would be best, if you used htmlspecialchars for HTML output and PDO::quote (or mysql_real_escape_string or whatever you are using) for SQL queries.
But I know (from my own experience) that escaping all user input in SQL queries may be really annoying and sometimes I simply don't escape parts, because I think they are "secure". But I am sure I'm not always right, about assuming that. So, in the end I wanted to ensure that I really escape all variables used in an SQL query and therefore have written a little class to do this easily: http://github.com/nikic/DB Maybe you want to use something similar, too.
Put this code into your header page. It can prevent SQL injection attack in PHP.
function clean_header($string)
{
$string = trim($string);
// From RFC 822: “The field-body may be composed of any ASCII
// characters, except CR or LF.”
if (strpos($string, “\n“) !== false) {
$string = substr($string, 0, strpos($string, “\n“));
}
if (strpos($string, “\r“) !== false) {
$string = substr($string, 0, strpos($string, “\r“));
}
return $string;
}