URL/HTML escaping/encoding

URL/HTML escaping/encoding - php

I have always been confused with URL/HTML encoding/escaping. I am using PHP, so I want to clear some things up.
Can I say that I should always use
urlencode: for individual query string parts
$url = 'http://test.com?param1=' . urlencode('some data') . '&param2=' . urlencode('something else');
htmlentities: for escaping special characters like <> so that if will be rendered properly by the browser
Would there be any other places I might use each function? I am not good at all these escaping stuff and am always confused by them.

First off, you shouldn't be using htmlentities() around 99% of the time. Instead, you should use htmlspecialchars() for escaping text for use inside XML and HTML documents.
htmlentities are only useful for displaying characters that the native character set you're using can't display (it is useful if your pages are in ASCII, but you have some UTF-8 characters you would like to display). Instead, just make the whole page UTF-8 (it's not hard), and be done with it.
As far as urlencode(), you hit the nail on the head.
So, to recap:
Inside HTML:
<b><?php echo htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?></b>
Inside of a URL:
$url = '?foo=' . urlencode('bar');

That's about right. Although - htmlspecialchars is fine, as long as you get your charsets straight. Which you should do anyway. So I tend to use that, so I would find out early if I had messed it up.
Also note that if you put a URL into an HTML context (say - in the href of an a-tag), you need to escape that. So you'll often see something like:
echo "<a href='" . htmlspecialchars("?foo=" . urlencode($foo)) . "'>clicky</a>"

If you are building a query string for your URL, then it's best to just use http_build_query() instead of manually encoding each part.
$params = [
'param1' => 'some data',
'param2' => 'something else',
];
echo 'Link';
All output in HTML should be HTML encoded too, despite there being a very tiny chance your URL, which is properly encoded, will break the HTML.

Related

How to properly create HTML links in PHP?

This question is about the proper use of rawurlencode, http_build_query & htmlspecialchars.
Until now my standard way of creating HTML link in vanilla PHP was this:
$qs = [
'foo' => 'foo~bar',
'bar' => 'bar foo',
];
echo 'Link';
Recently I have learned that this is not 100% correct. Here are few issues:
http_build_query uses by default PHP_QUERY_RFC1738 instead of PHP_QUERY_RFC3986. RFC3986 is the standard and superseded RFC1738 which in PHP is only kept for legacy use.
While the "special" HTML characters in the key and value part will be encoded to the percent-encoded representation, the argument separator will be an ampersand. In most sane situations this would not be a problem, but sometimes your key name might be quot; and then your link will become invalid:
$qs = [
'a' => 'a',
'quot;' => 'bar',
];
echo 'Link';
The code above will generate this link: ?a=a"%3B=bar!
IMO this implies that the function http_build_query needs to be called context-aware with the 3-rd argument & when in HTML, and with just & when in header('Location: ...');. Another option would be to pass it through htmlspecialchars before displaying in HTML.
PHP manual for urlencode (which should be deprecated long time ago IMO) suggests to encode only the value part of query string and then pass the whole query string through htmlentities before displaying in HTML. This looks very incorrect to me; the key part could still contain forbidden URL characters.
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
My conclusion is to do something along this lines:
$qs = [
'a' => 'a',
'quot;' => 'bar foo',
];
echo 'Link';
What is the recommended way to create HTML links in PHP? Is there an easier way than what I came up with? Have I missed any crucial points?

How to dynamically build HTML links with query string?
If you need to create query string to be used in HTML link (e.g. Link) then you should use http_build_query.
This function accepts 4 parameters, with the first one being an array/object of query data. For the most part the other 3 parameters are irrelevant.
$qs = [
'a' => 'a',
'quot;' => 'bar foo',
];
echo 'Link';
However, you should still pass the output of the function through htmlspecialchars to encode the & correctly. "A good framework will do this automatically, like Laravel's {{ }}"
echo 'Link';
Alternatively you can pass the third argument to http_build_query as '&', leaving the second one null. This will use & instead of & which is what htmlspecialchars would do.
About spaces.
For use in form data (i.e. query strings) the space should be encoded as + and in any other place it should be encoded as %20 e.g. new%20page.php?my+field=my+val. This is to ensure backwards comparability with all browsers. You can use the newer RFC3986 which will encode the spaces as %20 and it will work in all common browsers as well as be up to date with modern standards.
echo 'Link';
rawurlencode vs urlencode
For any part of URL before ? you should use rawurlencode. For example:
$subdir = rawurlencode('blue+light blue');
echo 'rawurlencode';
If in the above example you used urlencode the link would be broken. urlencode has very limited use and should be avoided.
Do not pass whole URL through rawurlencode. Separators / and other special characters in URL should not be encoded if they are to fulfil their function.
Footnote
There is no general agreement on the best practices for using http_build_query, other than the fact it should be passed through htmlspecialchars just like any other output in HTML context.
Laravel uses http_build_query($array, null, '&', PHP_QUERY_RFC3986)
CodeIgniter uses http_build_query($query)
Symfony uses http_build_query($extra, '', '&', PHP_QUERY_RFC3986)
Slim uses http_build_query($queryParams)
CakePHP uses http_build_query($query)
Twig uses http_build_query($url, '', '&', PHP_QUERY_RFC3986)

How to retrieve original text after using htmlspecialchars() and htmlentities()

I have some text that I will be saving to my DB. Text may look something like this: Welcome & This is a test paragraph. When I save this text to my DB after processing it using htmlspecialchars() and htmlentities() in PHP, the sentence will look like this: Welcome & This is a test paragraph.
When I retrieve and display the same text, I want it to be in the original format. How can I do that?
This is the code that I use;
$text= htmlspecialchars(htmlentities($_POST['text']));
$text= mysqli_real_escape_string($conn,$text);

There are two problems.
First, you are double-encoding HTML characters by using both htmlentities and htmlspecialchars. Both of those functions do the same thing, but htmlspecialchars only does it with a subset of characters that have HTML character entity equivalents (the special ones.) So with your example, the ampersand would be encoded twice (since it is a special character), so what you would actually get would be:
$example = 'Welcome & This is a test paragraph';
$example = htmlentities($example);
var_dump($example); // 'Welcome & This is a test paragraph'
$example = htmlspecialchars($example);
var_dump($example); // 'Welcome &amp; This is a test paragraph'
Decide which one of those functions you need to use (probably htmlspecialchars will be sufficient) and use only one of them.
Second, you are using these functions at the wrong time. htmlentities and htmlspecialchars will not do anything to "sanitize" your data for input into your database. (Not saying that's what you're intending, as you haven't mentioned this, but many people do seem to try to do this.) If you want to protect yourself from SQL injection, bind your values to prepared statements. Escaping it as you are currently doing with mysqli_real_escape_string is good, but it isn't really sufficient.
htmlspecialchars and htmlentities have specific purposes: to convert characters in strings that you are going to output into an HTML document. Just wait to use them until you are ready to do that.

converting url sperators with slash

I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.

You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.

You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate

Perhaps, you want to use urldecode() and remove the whitespace afterwards?

Mitigate xss attacks when building links

I posted this question a while back and it is working great for finding and 'linkifying' links from user generated posts.
Linkify Regex Function PHP Daring Fireball Method
<?php
if (!function_exists("html")) {
function html($string){
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
}
if ( false === function_exists('linkify') ):
function linkify($str) {
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
}
endif;
echo "<div>" . linkify(html($row_rsgetpost['userinput'])) . "</div>";
?>
I am concerned that I may be introducing a security risk by inserting user generated content into a link. I am already escaping user content coming from my database with htmlspecialchars($string, ENT_QUOTES, 'UTF-8') before running it through the linkify function and echoing back to the page, but I've read on OWASP that link attributes need to be treated specially to mitigate XSS. I am thinking this function is ok since it places the user-generated content inside double quotes and has already been escaped with htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), but would really appreciate someone with xss expertise to confirm this. Thanks!

First of data must NEVER be escaped before entering the database, this is very serious mistake. This is not only insecure, but it breaks functionality. Chaining the values of strings, is data corruption and affects string comparison. This approach is insecure because XSS is an output problem. When you are inserting data into the database you do not know where it appears on the page. For instance, even if you where this function the following code is still vulnerable to XSS:
For example:
<a href="javascript:alert(1)" \>
In terms of your regular expression. My initial reaction was, well this is a horrible idea. No comments on how its supposed to work and heavy use of NOT operators, blacklisting is always worse than white-listing.
So I loaded up Regex Buddy and in about 3 minutes I bypassed your regex with this input:
https://test.com/test'onclick='alert(1);//
No developer wants to write a vulnerably, so they are caused with a breakdown in how programmer thinks his application is working, and how it actually works. In this case i would assume you never tested this regex, and its a gross oversimplification of the problem.
HTMLPurifer is a php library designed to clean HTML, it consist of THOUSANDS of regular expressions. Its very slow, and is bypassed on a fairly regular basis. So if you go this route, make sure to update regularly.
In terms of fixing this flaw i think your best off using htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), and then enforcing that the string start with 'http'. HTML encoding is a form of escaping, and the value will be automatically decoded such that the URL is unmolested.

Because the data is going into an attribute, it should be url (or percent) encoded:
return '' . "$input";
Technically it should also then be html encoded
return '' . "$input";
but no browsers I know of care and consequently no-one does it, and it sounds like you might be doing this step already and you don't want to do this twice

Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.
The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that #Rook has been mentioning.
So I cannot really think of a way how an XSS attack could be performed the following code as suggested by #tobyodavies, but without urlencode, which does something else:
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '' . "$input";
}, $str);
Note that I have also a added a small shortcut for checking the http prefix.
Now the anchor links that you generate are safe.
However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.

Firstly, as the PHP documentation states htmlspecialchars only escapes
" '&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
". javascript: is still used in regular programming, so why : isn't escaped is beyond me.
Secondly, if !html only expects the characters you think will be entered, not the representation of those characters that can be entered and are seen as valid. the utf-8 character set, and every other character set supports multiple representations for the same character. Also, your false statement allows 0-9 and a-z, so you still have to worry about base64 characters. I'd call your code a good attempt, but it needs a ton of refining. That or you could just use htmlpurifier, which people can still bypass. I do think that it is awesome that you set the character set in htmlspecialchars, since most programmers don't understand why they should do that.

Automatic addition of trailing slash to urlencoded urls

I am very confused about the following:
echo("<a href='http://".urlencode("www.test.com/test.php?x=1&y=2")."'>test</a><br>");
echo("<a href='http://"."www.test.com/test.php?x=1&y=2"."'>test</a>");
The first link gets a trailing slash added (that's causing me problems)
The second link does not.
Can anyone help me to understand why.
Clearly it appears to be something to do with urlencode, but I can't find out what.
Thanks
c

You should not be using urlencode() to echo URLs, unless they contain some non standard characters.
The example provided doesn't contain anything unusual.
Example
$query = 'hello how are you?';
echo 'http://example.com/?q=' . urlencode($query);
// Ouputs http://example.com/?q=hello+how+are+you%3F
See I used it because the $query variable may contain spaces, question marks, etc. I can not use the question mark because it denotes the start of a query string, e.g. index.php?page=1.
In fact, that example would be better off just being output rather than echo'd.
Also, when I tried your example code, I did not get a traling slash, in fact I got
<a href='http://www.test.com%2Ftest.php%3Fx%3D1%26y%3D2'>test</a>

string urlencode ( string $str )
This function is convenient when
encoding a string to be used in a
query part of a URL, as a convenient
way to pass variables to the next
page.
Your urlencode is not used properly in your case.
Plus, echo don't usually come with () it should be echo "<a href='http [...]</a>";

You should use urlencode() for parameters only! Example:
echo 'http://example.com/index.php?some_link='.urlencode('some value containing special chars like whitespace');
You can use this to pass URLs, etc. to your URL.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.