This question is about the proper use of rawurlencode, http_build_query & htmlspecialchars.
Until now my standard way of creating HTML link in vanilla PHP was this:
$qs = [
'foo' => 'foo~bar',
'bar' => 'bar foo',
];
echo 'Link';
Recently I have learned that this is not 100% correct. Here are few issues:
http_build_query uses by default PHP_QUERY_RFC1738 instead of PHP_QUERY_RFC3986. RFC3986 is the standard and superseded RFC1738 which in PHP is only kept for legacy use.
While the "special" HTML characters in the key and value part will be encoded to the percent-encoded representation, the argument separator will be an ampersand. In most sane situations this would not be a problem, but sometimes your key name might be quot; and then your link will become invalid:
$qs = [
'a' => 'a',
'quot;' => 'bar',
];
echo 'Link';
The code above will generate this link: ?a=a"%3B=bar!
IMO this implies that the function http_build_query needs to be called context-aware with the 3-rd argument & when in HTML, and with just & when in header('Location: ...');. Another option would be to pass it through htmlspecialchars before displaying in HTML.
PHP manual for urlencode (which should be deprecated long time ago IMO) suggests to encode only the value part of query string and then pass the whole query string through htmlentities before displaying in HTML. This looks very incorrect to me; the key part could still contain forbidden URL characters.
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
My conclusion is to do something along this lines:
$qs = [
'a' => 'a',
'quot;' => 'bar foo',
];
echo 'Link';
What is the recommended way to create HTML links in PHP? Is there an easier way than what I came up with? Have I missed any crucial points?
How to dynamically build HTML links with query string?
If you need to create query string to be used in HTML link (e.g. Link) then you should use http_build_query.
This function accepts 4 parameters, with the first one being an array/object of query data. For the most part the other 3 parameters are irrelevant.
$qs = [
'a' => 'a',
'quot;' => 'bar foo',
];
echo 'Link';
However, you should still pass the output of the function through htmlspecialchars to encode the & correctly. "A good framework will do this automatically, like Laravel's {{ }}"
echo 'Link';
Alternatively you can pass the third argument to http_build_query as '&', leaving the second one null. This will use & instead of & which is what htmlspecialchars would do.
About spaces.
For use in form data (i.e. query strings) the space should be encoded as + and in any other place it should be encoded as %20 e.g. new%20page.php?my+field=my+val. This is to ensure backwards comparability with all browsers. You can use the newer RFC3986 which will encode the spaces as %20 and it will work in all common browsers as well as be up to date with modern standards.
echo 'Link';
rawurlencode vs urlencode
For any part of URL before ? you should use rawurlencode. For example:
$subdir = rawurlencode('blue+light blue');
echo 'rawurlencode';
If in the above example you used urlencode the link would be broken. urlencode has very limited use and should be avoided.
Do not pass whole URL through rawurlencode. Separators / and other special characters in URL should not be encoded if they are to fulfil their function.
Footnote
There is no general agreement on the best practices for using http_build_query, other than the fact it should be passed through htmlspecialchars just like any other output in HTML context.
Laravel uses http_build_query($array, null, '&', PHP_QUERY_RFC3986)
CodeIgniter uses http_build_query($query)
Symfony uses http_build_query($extra, '', '&', PHP_QUERY_RFC3986)
Slim uses http_build_query($queryParams)
CakePHP uses http_build_query($query)
Twig uses http_build_query($url, '', '&', PHP_QUERY_RFC3986)
Related
Why do we not encode = and & in query strings? I am referencing RFC 3986 but cannot find where it says that we should not encode these characters. Using Guzzle, it doesn't seem they encode anything really.
Take for example the query string: key1='1'&key2='2', shouldn't this be encoded as key1%3D%271%27%26key2%3D%272%27? If I plug key1='1'&key2='2' into chrome as a query string (e.g. www.google.com?key1='1'&key2='2'), it appears as key1=%271%27&key2=%272%27, which does not match guzzle. Guzzle outputs key1='1'&key2='2'. Guzzle's encoding algorithm is below:
private static $charUnreserved = 'a-zA-Z0-9_\-\.~';
private static $charSubDelims = '!\$&\'\(\)\*\+,;=';
public function encode()
{
return preg_replace_callback(
'/(?:[^' . self::$charUnreserved . self::$charSubDelims . '%:#\/\?]++|%(?![A-Fa-f0-9]{2}))/',
function ($match) {
return urlencode($match[0]);
},
$str
);
}
= and & don't have any special meaning as part of URL syntax. As far as URL syntax is concerned, they're just ordinary characters.
However, when used in query strings, there's a convention implemented by most application frameworks to use them to delimit parameters and values. If you want to use these characters literally in a parameter name or value, you need to encode them. See escaping ampersand in url
I'm setting up pagination on a search page and trying to the search query to each number.
href="?s=search+term"
The problem is when a user enters special characters such as #, it will comment out anything behind it.
Normally I use htmlentities to turn it into %23 however it is not working in this situation.
Keep in mind that the first time it searchs it looks like this in the search query
href="?s=%23+search+term"
and upon clicking a page number the search query then looks like this
href="?s=#%20search%20term"
Which, when executed by php, is commented out. Any ideas on how to bypass this?
You'll need to use urlencode() on the search term to properly encode it for use in a url.
http://php.net/manual/en/function.urlencode.php
As a better option, you can generate the entire querystring from an array using http_build_query():
$params = [
's' => "my search term",
'p' => "3"
];
echo http_build_query($params); // will echo a properly encoded querystring
I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?
The following redirect url becomes with http%3A%2F%2F instead of http://. How can I avoid this?
Thanks in advance.
$params = array(
'client_id' => $client_id,
'redirect_uri' => site_url('welcome/google_connect_redirect/'),
'state' => $_SESSION['state'],
'approval_prompt' => 'force',
'scope' => 'https://www.googleapis.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email',
'response_type' => 'code'
);
$url = "https://accounts.google.com/o/oauth2/auth?".http_build_query($params);
// send to google
redirect($url);
URL becomes like this.
https://accounts.google.com/o/oauth2/auth?client_id=871111192098.apps.
googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8888%2Fmyappname
%2Findex.php%2Fwelcome%2Fgoogle_connect_redirect&state=f0babsomeletterscb5b48753358c
3b9&approval_prompt=force&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2F
userinfo.profile+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&
response_type=code
When you put strings with special characters into URL, they will be encoded, you can use urldecode
The point of http_build_query() is that it urlencode()s each of the array's values for you before joining them in a querystring format. This is the preferred behavior.
The query string is encoded because there are some special characters that have special meaning in a URL.
From Wikipedia:
Some characters cannot be part of a URL (for example, the space) and
some other characters have a special meaning in a URL: for example,
the character # can be used to further specify a subsection (or
fragment) of a document; the character = is used to separate a name
from a value. A query string may need to be converted to satisfy these
constraints. This can be done using a schema known as URL encoding.
It's actually desired behavior and proper way to do it.
Look at manual and description of function http_build_query
Generates a URL-encoded query string from the associative (or indexed) array provided.
So basically what it does it goes through whole array and urlencode it (that's why you see these characters) and joins it with &. If you want to avoid it then don't use http_build_query() but I really don't recommend it.
I have always been confused with URL/HTML encoding/escaping. I am using PHP, so I want to clear some things up.
Can I say that I should always use
urlencode: for individual query string parts
$url = 'http://test.com?param1=' . urlencode('some data') . '¶m2=' . urlencode('something else');
htmlentities: for escaping special characters like <> so that if will be rendered properly by the browser
Would there be any other places I might use each function? I am not good at all these escaping stuff and am always confused by them.
First off, you shouldn't be using htmlentities() around 99% of the time. Instead, you should use htmlspecialchars() for escaping text for use inside XML and HTML documents.
htmlentities are only useful for displaying characters that the native character set you're using can't display (it is useful if your pages are in ASCII, but you have some UTF-8 characters you would like to display). Instead, just make the whole page UTF-8 (it's not hard), and be done with it.
As far as urlencode(), you hit the nail on the head.
So, to recap:
Inside HTML:
<b><?php echo htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?></b>
Inside of a URL:
$url = '?foo=' . urlencode('bar');
That's about right. Although - htmlspecialchars is fine, as long as you get your charsets straight. Which you should do anyway. So I tend to use that, so I would find out early if I had messed it up.
Also note that if you put a URL into an HTML context (say - in the href of an a-tag), you need to escape that. So you'll often see something like:
echo "<a href='" . htmlspecialchars("?foo=" . urlencode($foo)) . "'>clicky</a>"
If you are building a query string for your URL, then it's best to just use http_build_query() instead of manually encoding each part.
$params = [
'param1' => 'some data',
'param2' => 'something else',
];
echo 'Link';
All output in HTML should be HTML encoded too, despite there being a very tiny chance your URL, which is properly encoded, will break the HTML.