Non Latin characters in friendly url for searching purposes - php

Context: I want to allow non latin characters in my url.
Why: Search term would be part of a url. Example: example.tld/search-term
Facts: Only modern browsers would show decoded characters, cause they MUST use percent encoding for internal purposes. But some sites, like wikipedia, use NON-Latin characters in their URLs.
Question:
What should I do? Which problem(s) could I have by allowing search-terms to be passed that way? Should I do something special to retrieve this term form my php file? Any url encoding function?
Thanks for your time :D

Related

How can I manage a URL with special characters in cURL? [duplicate]

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

Problem with words written with other characters

On site I have problem with spamers, who write the url with other characters. Is implemented blacklist of words and url detector. But I have a problem with words writing like that '𝐰𝐰𝐰.3𝐬𝐞𝐱.𝐱𝐲𝐳' or '𝐛𝐮𝐦𝐞𝐧.𝐩𝐰' or '𝐰𝐰𝐰.𝐭𝐲𝐭𝐞𝐬.𝐢𝐧𝐟𝐨' . Probably latin letters are replace by other signs with utf-8 table.
And this is my question:
Is there a library in PHP or way translate other utf-8 chars to normal letters?
For example, when we write this modify url in Chrome browser, the browser
automatically translate to normal letters.
Why not check the URL for well-known, accepted characters and reject the query if there are others ?
This is fairly easy to write and is more future-proof.

What unicode character groups should we limit the user to, to create Beautiful URLs?

I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?

SEO Canonical URL in Greek characters

I have a URL which including Greek letters
http://www.mydomanain.com/gr/τιτλος-σελιδας/20/
I am using $_SERVER['REQUEST_URI'] to insert value to canonical link in my page head like this
<link rel="canonical" href="http://www.mydomanain.com<?php echo $_SERVER['REQUEST_URI']; ?>" />
The problem is when I am viewing the page source the URL is displayed with characters like ...CE%B3%CE%B3%CE%B5%CE%BB...but when clicking on it, its display the link as it should be
Is this will caused any penalty from search engines?
No, this is the correct behaviour. All characters in urls can be present in the page source using their human readable form or in encoded form which can be translated back using tables for the relevant character set. When the link is clicked, the encoded value is sent to the server which translates it back to it's human readable form.
It is common to encode characters that may cause issues in urls - spaces being a common example (%20) see Ascii tables. The %xx syntax refers to the equivalent HEX value of the character.
Search engines will be aware of this and interpret the characters correctly.
When sending the HTML to the browser, ensure that the character set specified by the server matches your HTML. Search engines will also look for this to correctly decode the HTML. The correct way to do this is via HTTP response headers. In PHP these are set with header:
header('Content-Type: text/html; charset=utf-8');
// Change utf-8 to a different encoding if used
URLs can only consist of a limited subset of ASCII characters. You cannot in fact use "greek characters" in a URL. All characters outside this limited ASCII range must be percent-encoded.
Now, browsers do two things:
If they encounter URLs in your HTML which fall outside this rule, i.e. which contain unencoded non-ASCII characters, the browser will helpfully encode them for you before sending off the request to your server.
For some (unambiguous) characters, the browser will display them in their decoded form in the address bar, to enhance the UX.
So, yeah, all is good. In fact, you should be percent-encoding your URLs yourself if they aren't already.

Converting UTF8 text for use in a URL

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.
Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).
You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):
// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo ''.$str.'';
This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).
If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.
I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:
último año
and produces output like:
'ultimo a~no
Then I use preg_replace() to replace white spaces with dashes:
'ultimo-a~no
... and remove unwanted chars, e.g.
[^a-z0-9-]
It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.
Obviously I can't use the non english characters in the URL.
In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.
Notice that you need to encode the URL appropriately, as shown in the other answers.
Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.
Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.
The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)
Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png
So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)
I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Categories