Space and non-Latin characters in URL - php

I want my URLs to be as much human-readable and pretty as possible. I noticed even space character works in URL. Copy:http://en.wikipedia.org/wiki/Prince of Persia or http://en.wikipedia.org/wiki/سنڌي to your borowers' address bar, and it works!
These work too:
<a href='http://en.wikipedia.org/wiki/Prince of Persia'> Foo </a>
<a href='http://en.wikipedia.org/wiki/سنڌي'> Bar </a>
How much safe is using Unicode letter beyond A-Z in URLs? My URLs are simple without any punctuation mark, similar to Wikipedia links.
It's not important for me that it is valid or invalid, I only want it works!
(actually I am going to use + instead of '' and my main concern is about Unicode text)
Does above works hassle-free in all common browsers?

This is completely browser dependent as it's just a UI gimmick. The actual URL used is the encoded one and if you try to copy-paste it, you'll see that your clipboard has the URL in URL encoding.

When you call "http://en.wikipedia.org/wiki/Prince of Persia" its your browser doing magic behind the scene. Space char is converted to %20 escape format and sent to web server.
http://en.wikipedia.org/wiki/Prince%20of%20Persia
Wikipedia however does not return you the real page just yet. Your browser receives redirect reply with the underscore format. Then your browser downloads the real content. Its wikipedia server business logic to sanitize visible urls.
HTTP/1.0 301 Moved Permanently
Location: http://en.wikipedia.org/wiki/Prince_of_Persia
You just put your browser and internet working harder :-)
Any unicode letter is fine, they are %XX escaped before sent to web browser. Here is a real url format seen by destination web server.
http://en.wikipedia.org/wiki/%D8%B3%D9%86%DA%8C%D9%8A
Its up to web server know how to handle UTF8 unicode escaping, most modern servers know what to do. Smart browsers can convert %XX escapes to visible letters in an address bar field. When you doing http calls programmatically you need to know how escaping works.

Related

How can I manage a URL with special characters in cURL? [duplicate]

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

What unicode character groups should we limit the user to, to create Beautiful URLs?

I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?

SEO Canonical URL in Greek characters

I have a URL which including Greek letters
http://www.mydomanain.com/gr/τιτλος-σελιδας/20/
I am using $_SERVER['REQUEST_URI'] to insert value to canonical link in my page head like this
<link rel="canonical" href="http://www.mydomanain.com<?php echo $_SERVER['REQUEST_URI']; ?>" />
The problem is when I am viewing the page source the URL is displayed with characters like ...CE%B3%CE%B3%CE%B5%CE%BB...but when clicking on it, its display the link as it should be
Is this will caused any penalty from search engines?
No, this is the correct behaviour. All characters in urls can be present in the page source using their human readable form or in encoded form which can be translated back using tables for the relevant character set. When the link is clicked, the encoded value is sent to the server which translates it back to it's human readable form.
It is common to encode characters that may cause issues in urls - spaces being a common example (%20) see Ascii tables. The %xx syntax refers to the equivalent HEX value of the character.
Search engines will be aware of this and interpret the characters correctly.
When sending the HTML to the browser, ensure that the character set specified by the server matches your HTML. Search engines will also look for this to correctly decode the HTML. The correct way to do this is via HTTP response headers. In PHP these are set with header:
header('Content-Type: text/html; charset=utf-8');
// Change utf-8 to a different encoding if used
URLs can only consist of a limited subset of ASCII characters. You cannot in fact use "greek characters" in a URL. All characters outside this limited ASCII range must be percent-encoded.
Now, browsers do two things:
If they encounter URLs in your HTML which fall outside this rule, i.e. which contain unencoded non-ASCII characters, the browser will helpfully encode them for you before sending off the request to your server.
For some (unambiguous) characters, the browser will display them in their decoded form in the address bar, to enhance the UX.
So, yeah, all is good. In fact, you should be percent-encoding your URLs yourself if they aren't already.

Accents, urls and Firefox

I'm having some problems and I was wondering if any of you could help me.
I have my site & DB set to utf8. I have a problem when I type in accents in the query strings section ã turns to %E3, but if i use links or forms within the page it gives %C3%A3 in the url.
What can I do?
EDIT: Let me try to clarify this a bit:
I'm trying to use accented characters in my URLs (query strings) but I'm having somewhat of a hard time getting this to work across multiple browsers. Some browsers like Firefox and IE output a different percent encoded string depending on whether I'm using a form within the page or typing the accented character in the address bar. Like I said in my original question, ã inputed in a form turns to %C3%A3 in the url but if I type ã in the address bar, the browser changes that to %E3 in the url.
This complicates things for me because if I get %E3, then in php/html I get an unknown character (that is the diamond question mark, correct?)
Hopefully this helps - let me know otherwise.
ã inputed in a form turns to %C3%A3 in the url
Depends on the form encoding, which is usually taken from the encoding of the page that contains the form. %C3%A9 is the correct UTF-8 URL-encoded form of ã.
if I type ã in the address bar, the browser changes that to %E3 in the url.
This is browser-dependent. When you put non-ASCII characters in the a URL in location bar:
http://www.example.com/test.p/café?café
WebKit browsers encode them all as UTF-8:
http://www.example.com/test.p/caf%C3%A9?caf%C3%A9
which is IMO most correct, as per IRI. However, IE and Opera, for historical reasons, use the OS's default system encoding to encode text entered into the query string only. So on a Western European Windows installation (using code page 1252), you get:
http://www.example.com/test.p/caf%C3%A9?caf%E9
For characters that aren't available in the system encoding, IE and Opera replaces them with a ?. Firefox will use the system encoding when all the characters in the query string, or UTF-8 otherwise.
Horrible and inconsistent, but then it's pretty rare for users to manually type out query strings.

UTF-8 characters that aren't XSS vulnerabilities

I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded.
Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.
The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?
Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars
echo htmlspecialchars($string, ENT_QUOTES, 'utf-8');
But if this is sufficient or not depends on what you're printing (and where).
see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;)
The stringjavascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) passes htmlentities() unchanged. Now consider something like<a href="<?php echo htmlentities($_GET['homepage']); ?>"which will send<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">to the browser. And that boils down tohref="javascript:eval(\"alert('XSS')\")"While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.
In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:
Always ensure that what you're
sending to the client is tagged as
UTF-8. This means having a header
that explicitly says "Content-Type:
text/html; charset=utf-8" on every
single page, including all of your
error pages if any of the content on
those error pages is generated from
user input. (Many people forget to
test their 404 page, and have that
page include the not-found URL verbatim)
Always ensure that
what you're sending to the client is
valid UTF-8. This means you
cannot simply pass through
bytes received from the user back to
the user again. You need to decode
the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode
them as UTF-8 as you write them back
out.
The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)
The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.

Categories