Mitigate xss attacks when building links - php

I posted this question a while back and it is working great for finding and 'linkifying' links from user generated posts.
Linkify Regex Function PHP Daring Fireball Method
<?php
if (!function_exists("html")) {
function html($string){
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
}
if ( false === function_exists('linkify') ):
function linkify($str) {
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
}
endif;
echo "<div>" . linkify(html($row_rsgetpost['userinput'])) . "</div>";
?>
I am concerned that I may be introducing a security risk by inserting user generated content into a link. I am already escaping user content coming from my database with htmlspecialchars($string, ENT_QUOTES, 'UTF-8') before running it through the linkify function and echoing back to the page, but I've read on OWASP that link attributes need to be treated specially to mitigate XSS. I am thinking this function is ok since it places the user-generated content inside double quotes and has already been escaped with htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), but would really appreciate someone with xss expertise to confirm this. Thanks!

First of data must NEVER be escaped before entering the database, this is very serious mistake. This is not only insecure, but it breaks functionality. Chaining the values of strings, is data corruption and affects string comparison. This approach is insecure because XSS is an output problem. When you are inserting data into the database you do not know where it appears on the page. For instance, even if you where this function the following code is still vulnerable to XSS:
For example:
<a href="javascript:alert(1)" \>
In terms of your regular expression. My initial reaction was, well this is a horrible idea. No comments on how its supposed to work and heavy use of NOT operators, blacklisting is always worse than white-listing.
So I loaded up Regex Buddy and in about 3 minutes I bypassed your regex with this input:
https://test.com/test'onclick='alert(1);//
No developer wants to write a vulnerably, so they are caused with a breakdown in how programmer thinks his application is working, and how it actually works. In this case i would assume you never tested this regex, and its a gross oversimplification of the problem.
HTMLPurifer is a php library designed to clean HTML, it consist of THOUSANDS of regular expressions. Its very slow, and is bypassed on a fairly regular basis. So if you go this route, make sure to update regularly.
In terms of fixing this flaw i think your best off using htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), and then enforcing that the string start with 'http'. HTML encoding is a form of escaping, and the value will be automatically decoded such that the URL is unmolested.

Because the data is going into an attribute, it should be url (or percent) encoded:
return '' . "$input";
Technically it should also then be html encoded
return '' . "$input";
but no browsers I know of care and consequently no-one does it, and it sounds like you might be doing this step already and you don't want to do this twice

Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.
The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that #Rook has been mentioning.
So I cannot really think of a way how an XSS attack could be performed the following code as suggested by #tobyodavies, but without urlencode, which does something else:
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
Note that I have also a added a small shortcut for checking the http prefix.
Now the anchor links that you generate are safe.
However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.

Firstly, as the PHP documentation states htmlspecialchars only escapes
" '&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
". javascript: is still used in regular programming, so why : isn't escaped is beyond me.
Secondly, if !html only expects the characters you think will be entered, not the representation of those characters that can be entered and are seen as valid. the utf-8 character set, and every other character set supports multiple representations for the same character. Also, your false statement allows 0-9 and a-z, so you still have to worry about base64 characters. I'd call your code a good attempt, but it needs a ton of refining. That or you could just use htmlpurifier, which people can still bypass. I do think that it is awesome that you set the character set in htmlspecialchars, since most programmers don't understand why they should do that.

Related

Only run HTML (PHP)

I'm here with a question on a project; I try to explain as best as possible:
I have a text area in which the user can write whatever they want.
The problem is that they can try for some kind of malicious code (js xss, for example)
I was using the function:
echo htmlspecialchars($topic->getMessage(), ENT_QUOTES, 'UTF-8');
I thought I had solved the problem, but I remembered that the user can type HTML, and it is allowed.
Is there any function already made for running HTML and others stay as text?
As per PHP manual, htmlspecialchars performs the following translations:
'&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
Your HTML actually does get translated into safe characters.
After reading your question again (for it's not very clear), I thought maybe you want the HTML tags actually stay as HTML tags, meaning <b>bold</b> wouldn't get translated into <b>bold</b>
To do so, you may want to use str_replace after htmlspecialchars:
$result = htmlspecialchars($topic->getMessage(), ENT_QUOTES, 'UTF-8');
$result = str_replace(array("<",">"), array("<",">"), $result);
echo $result;
Or you could just translate &, ' (single quote) and " (double quote) via str_replace:
echo str_replace(array("&", "\"", "'"), array("&", """, "'"), $topic->getMessage());
Possibilities are endless.
htmlspecialchars is ok but not completely safe to insert into mysql.
For mysql it's better to use prepared statements, such as explained here:
http://bobby-tables.com/php.html
For output in the page (without inserting on database), htmlspecialchars is enough... provided you don't decode those before printing.
Like CBroe suggested, You could use http://htmlpurifier.org/ to clean the html and avoid garbage in your database, but you still must use prepared statements.
Also read: http://php.net/manual/en/pdo.prepared-statements.php

converting url sperators with slash

I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?

php - Clean user input using preg_replace_callback and ord()?

I have a forum style text box and I would like to sanitize the user input to stop potential xss and code insertion. I have seen htmlentities used, but then others have said that &,#,%,: characters need to be encoded as well, and it seems the more I look, the more potentially dangerous characters pop up. Whitelisting is problematic as there are many valid text options beyond ^a-zA-z0-9. I have come up with this code. Will it work to stop attacks and be secure? Is there any reason not to use it, or a better way?
function replaceHTML ($match) {
return "&#" . ord ($match[0]) . ";";
}
$clean = preg_replace_callback ( "/[^ a-zA-Z0-9]/", "replaceHTML", $userInput );
EDIT:_____________________________
I could of course be wrong, but it is my understanding that htmlentities only replaces & < > " (and ' if ENT_QUOTES is turned on). This is probably enough to stop most attacks (and frankly probably more than enough for my low traffic site). In my obsessive attention to detail, however, I dug further. A book I have warns to also encode # and % for "shutting down hex attacks". Two websites I found warned against allowing : and --. Its all rather confusing to me, and led me to explore converting all non-alphanumeric characters. If htmlentities does this already then great, but it does not seem to. Here are results from code I ran I copied after clicking view source in firefox.
original (random characters to test):
5:gjla#''*&$!j-l:4
preg_replace_callback:
<b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities (w/ ENT_QUOTES):
<b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities appears to not be encoding those other characters like :
Sorry for the wall of text. Is this just me being paranoid?
EDIT #2: ___________
All you need to do to stop XSS attacks is use htmlspecialchars().
That is exactly what htmlentities does already:
http://codepad.viper-7.com/NDZMa3
It will convert (spaced to prevent stackoverflow double encoding):
"& # amp ;"
to
"& # amp; # amp ;"
space ' ' can be changed to \s in your regex, also by adding /i at the end of the regex you made it case insensitive, and you don't need manually translate your chars to sequences, it can be done with a callback of htmlentities
$clean = preg_replace_callback('/[^a-z0-9\s]/i', 'htmlentities', $userInput);

URL/HTML escaping/encoding

I have always been confused with URL/HTML encoding/escaping. I am using PHP, so I want to clear some things up.
Can I say that I should always use
urlencode: for individual query string parts
$url = 'http://test.com?param1=' . urlencode('some data') . '&param2=' . urlencode('something else');
htmlentities: for escaping special characters like <> so that if will be rendered properly by the browser
Would there be any other places I might use each function? I am not good at all these escaping stuff and am always confused by them.
First off, you shouldn't be using htmlentities() around 99% of the time. Instead, you should use htmlspecialchars() for escaping text for use inside XML and HTML documents.
htmlentities are only useful for displaying characters that the native character set you're using can't display (it is useful if your pages are in ASCII, but you have some UTF-8 characters you would like to display). Instead, just make the whole page UTF-8 (it's not hard), and be done with it.
As far as urlencode(), you hit the nail on the head.
So, to recap:
Inside HTML:
<b><?php echo htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?></b>
Inside of a URL:
$url = '?foo=' . urlencode('bar');
That's about right. Although - htmlspecialchars is fine, as long as you get your charsets straight. Which you should do anyway. So I tend to use that, so I would find out early if I had messed it up.
Also note that if you put a URL into an HTML context (say - in the href of an a-tag), you need to escape that. So you'll often see something like:
echo "<a href='" . htmlspecialchars("?foo=" . urlencode($foo)) . "'>clicky</a>"
If you are building a query string for your URL, then it's best to just use http_build_query() instead of manually encoding each part.
$params = [
'param1' => 'some data',
'param2' => 'something else',
];
echo 'Link';
All output in HTML should be HTML encoded too, despite there being a very tiny chance your URL, which is properly encoded, will break the HTML.

sql injection / Browser-Hijacker prevention php

I have a website where I can't use html_entities() or html_specialchars() to process user input data. Instead, I added a custom function, which in the end is a function, which uses an array $forbidden to clean the input string of all unwanted characters. At the moment I have '<', '>', "'" as unwanted characters because of sql-injection/browser hijacking. My site is encoded in utf-8 - do I have to add more characters to that array, i.e. the characters '<', encoded in other charsets?
Thanks for any help,
Maenny
htmlentities nor htmlspecialchars functions has nothing to do with sql injection
to prevent injection, you have to follow some rules, I've described them all here
to filter HTML you may use htmlspecialchars() function, it will harm none of your cyrillic characters
You should escape ", too. It is much more harm than ', because you often enclose HTML attributes in ". But, why don't you simlpy use htmlspecialchars to do that job?
Futhermore: It isn't good to use one escaping function for both SQL and HTML. HTML needs escaping of tags, whereas SQL does not. So it would be best, if you used htmlspecialchars for HTML output and PDO::quote (or mysql_real_escape_string or whatever you are using) for SQL queries.
But I know (from my own experience) that escaping all user input in SQL queries may be really annoying and sometimes I simply don't escape parts, because I think they are "secure". But I am sure I'm not always right, about assuming that. So, in the end I wanted to ensure that I really escape all variables used in an SQL query and therefore have written a little class to do this easily: http://github.com/nikic/DB Maybe you want to use something similar, too.
Put this code into your header page. It can prevent SQL injection attack in PHP.
function clean_header($string)
{
$string = trim($string);
// From RFC 822: “The field-body may be composed of any ASCII
// characters, except CR or LF.”
if (strpos($string, “\n“) !== false) {
$string = substr($string, 0, strpos($string, “\n“));
}
if (strpos($string, “\r“) !== false) {
$string = substr($string, 0, strpos($string, “\r“));
}
return $string;
}

Categories