When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)
Related
I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?
I recently learned that overlong encodings cause a security risk when not properly validated. From the answer in the previously mentioned post:
For example the character < is usually represented as byte 0x3C, but
could also be represented using the overlong UTF-8 sequence 0xC0 0xBC
(or even more redundant 3- or 4-byte sequences).
And:
If you take this input and handle it in a Unicode-oblivious byte-based
tool, then any character processing step being used in that tool may
be evaded.
Meaning that if I use htmlspecialchars on a string that uses overlong encoding, then the output could still contain tags. I also assume that you could post similar characters (like " or ;) which could also be used for SQL injections.
Perhaps it is me, but I believe that this is a security risk relatively few people take into account and even know about. I've been coding for years and am only now finding this out.
Anyway, my question is: what tools can I use to send data with overlong encodings? People who are familiar with this risk: how do you perform tests on websites? I want to POST a bunch of overlong characters to my sites, but I have no idea how to do this.
In my situation I mostly use PHP and MySQL, but what I really want to know are testing tools, so I guess the back-end situation does not matter much.
I want to POST a bunch of overlong characters to my sites, but I have no idea how to do this.
Apart from testing it with manual request tools like curl, a simple workaround for in-browser testing is to override the encoding of the form submission. Using eg Firebug/Chrome Debugger, alter the form you're testing to add the attribute:
accept-charset="iso-8859-1"
You can now type characters that, when encoded as Windows code page 1252(*), become the UTF-8 overlong byte sequence you want.
For example, enter café into the form and you will get the byte sequence c a f 0xC3 0xA9 so the application will think you typed café. Enter À¼foo and the sequence 0xC0 0xBC f o o will be submitted, which could be interpreted as <foo. Note that you won't see <foo in any output page source because modern browsers don't parse overlong UTF-8 sequences in web pages, but you might get a �foo or other indication something isn't right.
For more in-depth access to doctor the input and check the output of a webapp, see dedicated sec tools like Burp.
To test if your site is vulnerable use curl to fets your page using post and the encoding to the utf8 long and post utf8 long encoded information(you could use your text editor for this by setting the text editor encoding to utf8 long so the text you post using curl and the php file is in long)
http://php.net/manual/en/function.curl-setopt.php
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?
No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)
The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters à and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters à and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.
Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}
If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php
I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.
I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.
Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "<" for "<")
Update 2:
#Konrad: Are you saying that, no, named entities are not needed?
#Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)
Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (<, >, &, ", '), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.
As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.
If using XHTML, it's actually recommended not to use named entities ([citation needed]). Some browsers (Firefox …), when parsing this as XML (which they normally don't), don't read the DTD files and thus are unable to handle the entities.
As it's best practice anyway to use UTF-8 as encoding if there are no compelling reasons to do otherwise, this only means that the creator of the documents needs a decent editor that can not only handle the documents but also provides a good way of entering the divers glyphs. OS X doesn't really have this problem because most needed glyphs can be reached via “alt” keys but Windows doesn't have this feature.
#Konrad: Are you saying that, no, named entities are not needed?
Precisely. Unless, of course, there are silly restrictions, e.g. legacy database drivers that choke on UTF-8 etc.
Safari seems to have issues with some glyphs but not others, it may not be needed but it's probably best to do so, of course, this is my opinion and not backed up by anything but my own observations.