My old page in linux worked perfectly, but when I try to change the server from Unix to Windows characters no longer work.
My old skull character "☠" "9760" is shown in box with hex definitions 26 and 20.
The box with 26 and 20 indicates lack of glyphs for the character U+2620 (decimal code 9760). This is one of the recommended ways of dealing with undisplayable characters according to the HTML 4.01 spec. So it indicates that the character has been properly recognized by the browser, it just cannot display it.
It sounds very odd that the OS of the server would affect this. It would be interesting to see the URLs of two versions that demonstrate such an effect. But changing browser or client computer may surely have an effect.
This is not entirely a browser problem, though. “Undisplayable” is relative, because browsers (especially IE) may fail to render a character, even though there is a glyph for it in available fonts. Therefore, you may wish to use a font-family setting with a suitable list of alternatives; to check out font coverage, see
http://www.fileformat.info/info/unicode/char/2620/fontsupport.htm
Related
When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)
we get book titles from different sources (library systems) (with possibly different encoding, but mostly utf8). These strings are shown in the web and via export to Endnote and RefWorks. RefWorks (windows Quotation system) does not accept any other encoding than ANSI.
In the RIS/Refworks export, activating the line
$smarty = iconv("UTF-8", "Windows-1252", $smarty);
Example string
Diphosphen-komplexes (CO) 5CrPhPPPhCr(CO) 5
does suddenly cut off everything after the first subscript char (the rectangles). These chars are also not correctly printed in HTML but this output is okay because nothing is cut off. In UTF-8 export file encoding nothing is cut off, too. Despite that, the Windows software can't read UTF-8.
The simplest solution would be to convert any subscript number to a regular number. Everything would work quite well then. But I could not find any simple solution to this. Working with hex codes is the only thing I could imagine. This solutions is also preferred for use in our Solr index.
Anybody knows a better solutions?
The example string contains Private Use code points such as U+E5F8. By definition, no standard assigns any meaning to them; their use is purely by private agreements. It is thus impossible to convert them to anything, or to do anything with them, without knowing or inferring the private agreements involved. Some systems use Private Use code points to represent some symbols that are assigned to those points in some special font. Knowing what that font is and inspecting it may thus help to find out the agreement.
The conversion would need to be coded separately, in an ad hoc manner, since there is an an hoc agreement involved.
“ANSI”, which here means windows-1252, does not contain any subscript characters. In the context of a chemical formula, replacing subscript digits by normal digits does not change the meaning, and the formula is understandable, though it looks unprofessional.
When converting to HTML format (or other rich text format), you can use normal digits wrapped in elements that cause subscript rendering (or otherwise style them). HTML has the sub element for this, but its implementations differ between browsers and tend to be a poor quality, so a better approach is to generate <span class=sub>...</span> and use CSS to set the vertical position and font size.
About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?
The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.
Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.
Okay, so emoji basically shows the above on a computer. Is that another programming language? So how do I put those little boxes into a php file? When I put it into a php file, it turns into question marks and what not. Also, how can I store these in a MySQL without it turning into question marks and other weird things?
how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.
This has nothing to do with programming languages, just with encoding and fonts. As a very brief overview: Every character is stored by its character code (e.g.: 0x41 = A, 0x42 = B, etc), which is rendered as a meaningful character on your screen using a font (which says "the character with the code 0x41 should look like this ...").
These emoji occupy the "private use area" of the Unicode table, which is a range of codes that are undefined and free for anyone to use. That makes them perfectly valid character codes, it's just that no standard font has an appropriate character to display for them, since they are undefined. Only the iPhone and other handhelds, mostly in Japan, have appropriate icons for these codes. This is done to save bandwidth; instead of transmitting relatively large image files back and forth, emoji can be transmitted using a single character code.
As for how to store them: They should be storable as is, as long as you don't try to convert them to another encoding, in which case they may get lost. Just be aware that they only make sense on the iPhone and other SoftBank phones in Japan.
Character Viewer http://img.skitch.com/20091110-e7nkuqbjrisabrdipk96p4yt59.png
If you're on OSX you can copy and paste the character into the Character Viewer to find out what it is. I think there's a similar Character Map on Windows (albeit inferior ;-P). You could put it through PHP's ord(), but that only works on ASCII characters. See the discussion on the ord page for UTF8 functions.
BTW, just for the fun of it, these characters display fine on the iPhone as is, because the iPhone has a font which has icons for them:
iPhone http://img.skitch.com/20091110-bjt3tutjxad1kw4p9uhem5jhnk.png
I'm using FF3.5 and WinXP. I see little boxes in my browser, too.
This tells me the string requires a character set not installed on my computer.
When you put the string into a PHP file, the question marks tell you the same thing: your computer doesn't know how to display the characters.
You could store these emoji characters in MySQL if you encoded them differently, probably using UTF-8.
Do a web search for character encoding, as it relates to MySQL.
I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.
I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.
Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "<" for "<")
Update 2:
#Konrad: Are you saying that, no, named entities are not needed?
#Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)
Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (<, >, &, ", '), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.
As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.
If using XHTML, it's actually recommended not to use named entities ([citation needed]). Some browsers (Firefox …), when parsing this as XML (which they normally don't), don't read the DTD files and thus are unable to handle the entities.
As it's best practice anyway to use UTF-8 as encoding if there are no compelling reasons to do otherwise, this only means that the creator of the documents needs a decent editor that can not only handle the documents but also provides a good way of entering the divers glyphs. OS X doesn't really have this problem because most needed glyphs can be reached via “alt” keys but Windows doesn't have this feature.
#Konrad: Are you saying that, no, named entities are not needed?
Precisely. Unless, of course, there are silly restrictions, e.g. legacy database drivers that choke on UTF-8 etc.
Safari seems to have issues with some glyphs but not others, it may not be needed but it's probably best to do so, of course, this is my opinion and not backed up by anything but my own observations.