Converting UTF8 text for use in a URL

Converting UTF8 text for use in a URL - php

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.
Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).

You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):
// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo ''.$str.'';
This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).
If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.

I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:
último año
and produces output like:
'ultimo a~no
Then I use preg_replace() to replace white spaces with dashes:
'ultimo-a~no
... and remove unwanted chars, e.g.
[^a-z0-9-]
It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.

Obviously I can't use the non english characters in the URL.
In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.
Notice that you need to encode the URL appropriately, as shown in the other answers.

Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.

Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.
The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)
Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png
So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)
I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Related

How can I manage a URL with special characters in cURL? [duplicate]

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.

There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.

When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.

Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.

You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat

There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

Non Latin characters in friendly url for searching purposes

Context: I want to allow non latin characters in my url.
Why: Search term would be part of a url. Example: example.tld/search-term
Facts: Only modern browsers would show decoded characters, cause they MUST use percent encoding for internal purposes. But some sites, like wikipedia, use NON-Latin characters in their URLs.
Question:
What should I do? Which problem(s) could I have by allowing search-terms to be passed that way? Should I do something special to retrieve this term form my php file? Any url encoding function?
Thanks for your time :D

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.

As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.

You could run it through your existing sanitizer, then anything not latin, you could convert to punycode

So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

Weird character encoding in Firefox

I have a problem with character encoding in Firefox. When I copy/paste a paragraph from Microsoft Word (2007), it could contains special character like this (dots/squares to make a list or quote) :
 Te’st
 Ze’f
• Gzg’a
The quote ’ is different compared to this quote ' (typed directly using keyboard). So I paste this in a textarea and save (using AJAX in some case). In the database (which has a collation latin1_swedish_ci) it shows perfectly fine. But when getting these data to edit again using Firefox, it shows weird binary symbols. Works fine in Chrome and IE.
I don't want to modify the charset of the database. Is there any way to solve this problem?
Note: you can also test by viewing this post in Chrome and FF

The characters you copypasted (assuming they got transmitted correctly into this forum) contain, in addition to letters, three occurrences of U+2019 RIGHT SINGLE QUOTATION MARK, which is the correct punctuation apostrophe in English and many other languages, one occurrence of U+2022 BULLET, which sounds ok, and two occurrences of U+F0A7, which is in the Private Use (PU) range and should not be used public information exchange, only for special purposes by mutual agreements between interested parties.
It is possible that some notations in Word 2007 documents get converted to PU characters in copy and paste, but at least normal list bullet normally becomes U+2022 BULLET. So it is a bit of a mystery where the PU characters come from.
Regarding single quotes, they are representable in windows-1252 too, and latin1_swedish_ci seems to cover it (though it is, as far as I understand, just the definition of collating order, rather than a character encoding). And as you are saying that the data looks fine in the database, it seems that problem is in the way in which the data is written in an HTML document served to the browser.
In particular, if the encoding of the page in which the data is then presented is UTF-8 and the actual data is there in windows-1252 encoding, problems arise. It would mean a problem like the one you describe, as U+2019 is encoded as 0x92 in windows-1252, and this causes a character-level data error when interpreted as UTF-8.
You can check the situation by using View→Encoding in Firefox when viewing the result page. If my hypothesis is correct, you will see UTF-8 selected there, and changing it to “West European (windows-1252)” makes the single quote appear (and may mess up other things on the page thoroughly).

Questions about iPhone emoji and web pages


Okay, so emoji basically shows the above on a computer. Is that another programming language? So how do I put those little boxes into a php file? When I put it into a php file, it turns into question marks and what not. Also, how can I store these in a MySQL without it turning into question marks and other weird things?

how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.

This has nothing to do with programming languages, just with encoding and fonts. As a very brief overview: Every character is stored by its character code (e.g.: 0x41 = A, 0x42 = B, etc), which is rendered as a meaningful character on your screen using a font (which says "the character with the code 0x41 should look like this ...").
These emoji occupy the "private use area" of the Unicode table, which is a range of codes that are undefined and free for anyone to use. That makes them perfectly valid character codes, it's just that no standard font has an appropriate character to display for them, since they are undefined. Only the iPhone and other handhelds, mostly in Japan, have appropriate icons for these codes. This is done to save bandwidth; instead of transmitting relatively large image files back and forth, emoji can be transmitted using a single character code.
As for how to store them: They should be storable as is, as long as you don't try to convert them to another encoding, in which case they may get lost. Just be aware that they only make sense on the iPhone and other SoftBank phones in Japan.
Character Viewer http://img.skitch.com/20091110-e7nkuqbjrisabrdipk96p4yt59.png
If you're on OSX you can copy and paste the character into the Character Viewer to find out what it is. I think there's a similar Character Map on Windows (albeit inferior ;-P). You could put it through PHP's ord(), but that only works on ASCII characters. See the discussion on the ord page for UTF8 functions.
BTW, just for the fun of it, these characters display fine on the iPhone as is, because the iPhone has a font which has icons for them:
iPhone http://img.skitch.com/20091110-bjt3tutjxad1kw4p9uhem5jhnk.png

I'm using FF3.5 and WinXP. I see little boxes in my browser, too.
This tells me the string requires a character set not installed on my computer.
When you put the string into a PHP file, the question marks tell you the same thing: your computer doesn't know how to display the characters.
You could store these emoji characters in MySQL if you encoded them differently, probably using UTF-8.
Do a web search for character encoding, as it relates to MySQL.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.