What is difference between UTF-8 and HTML entities?
The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).
Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.
Text containing only characters covered by ASCII:
Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)
Text containing European characters not covered by ASCII:
Beträge: 20€ (UTF-8)
Beträge: 20€ (ASCII with HTML entities)
Text containing Asian characters, most certainly not covered by ASCII:
値段:二千円 (UTF-8)
値段:二千円 (ASCII with HTML entities)
The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.
The problem with HTML entities is that normal characters take on a special meaning. When writing ä, it takes on the special meaning of "ä". If you actually intend to write "ä", you need to double encode the sequence as ä.
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.
The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. <b>text</b> is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "<b>text</b>", you will need to encode it as <b>text</b>, so the HTML parser doesn't mistake it for HTML tags.
See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)
Each natural number just happens to represent a 'character'.
HTML entities is a way to represent those same natural numbers in a way like: , stands for the natural number 127, in unicode that being the DEL character.
In UTF-8 that's the bytestream: 0111 1111
Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111.
Two DEL chars in a row become 0111 1111 0111 1111. UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.
UTF-8 is an encoding scheme for byte-level encoding.
HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them more human readable readable when UTF-8 is not available.
The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators (< or >) when placed in a certain order (i.e <text>) can accidentally render as HTML when the intent was for them to render as text.
A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, > outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.
UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.
UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual.
Related
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.
A user on my site inputted special characters into a text field: ä ö
These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨
On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.
The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.
What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.
Commonly used accented letters have their own code points in the Unicode standard, in this case:
U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"
However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:
U+0308 "COMBINING DIAERESIS"
When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.
As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:
Normalization Form D (NFD): Canonical Decomposition
Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD): Compatibility Decomposition
Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
Ignoring the "Compatibility" forms for now, we have two options:
Decomposition, which uses combining diacritics as often as possible
Composition, which uses specific code points as often as possible
So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.
However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.
You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.
I have a bunch of strings which I'm told have been encoded using the rawurlencode function in PHP.
Some of these strings contain percent encoded sequences for characters above unicode codepoint 127 - e.g. a%A0b.
I think the A0 in the above example is meant to represent a non-breaking space (Unicode codepoint 160 0xA0) but A0 on it's own is not a valid UTF-8 sequence (any byte with the high bit set (>127) is part of a multi-sequence). Thus .NET decodes this to ? by default.
I have tried a few different encodings. iso-8859-1 seems to fit, but I can't be sure.
This URL encoded string will contain non-english characters so it is critical that the conversion happens properly.
Which is the correct encoding to pass to System.Web.HttpUtility.ParseQueryString to decode a string that has been encoded with rawurlencode?
PHP's native string type is plain old bytes, with no encoding information attached. So rawurlencode doesn't do any handling of Unicode, it just hex-escapes each high byte to %xx.
If the application wants to treat those bytes as a representation of characters, it's up to the application to decide what encoding is in use. It would be lovely if the application told you that in the documentation, and it would be lovely if that encoding were UTF-8 which is the only sane choice. But apparently not.
iso-8859-1 seems to fit, but I can't be sure.
There are a lot of encodings that map character U+00A0 Non Breaking Space to byte 0xA0, including all the ISO-8859s and all the Windows code pages that are based on them. True ISO-8859-1 is relatively rare on the web, you're more likely to meet its mutant cousin Windows Western code page 1252 (GetEncoding(1252)).
The only way to tell would be to enter different characters into the application and see what comes out. What “non-English” characters are you expecting, any particular language?
I am writing a parser. I have taken care of all the encoding conversion to output UTF-8 correctly, but sometimes the source material is incorrect. such as ☐ or â€tm - the results of bad encoding conversion.
I know this is a long shot - but does anyone know of a list of common strings resulting from bad character conversions, or anything so I don't have to build my own list.
Yes I know I am being lazy, but I read somewhere that makes me a good programmer?
tl;dr: See last two paragraphs.
I hate/love encoding problems.
We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019). The byte sequence for that character is 0xE2 0x80 0x99. In Windows-1252, that corresponds to a+hat, Euro, and the trademark symbol (™). The 'tm' we see is a further transliteration of that trademark symbol into ASCII t and ASCII m, 0x74 0x6D, making our final corrupted sequence of bytes 0xE2 0x80 0x74 0x6D.
Chances are that the actual representation of a+hat-euro-t-m is already in UTF-8. That is, that a+hat is a UTF-8 sequence and the Euro symbol is also a UTF-8 sequence, because someone Copied from a Windows-1252 document that was already improperly encoded, and Pasted into a UTF-8 document. You'll find it's plenty more bytes than just the four from the original corruption.
One way to solve this would be first turning the UTF-8 encoding of those characters back into Windows-1252, then treat that Windows-1252 string as UTF-8 when writing it back out.
You can use iconv with the //TRANSLIT flag for this purpose:
$less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad);
This tells iconv to try turning any characters that can't be represented in Windows-1252 into something similar. This translation is imperfect and will destroy any legitimate UTF-8 characters that aren't representable in Windows-1252.
Once you have the Windows-1252 string, save it back out and serve it up as UTF-8. If all went well, the corruption should be gone, and you shouldn't have any problems.
Yeah, right.
In this specific case, the final byte of the proper sequence, 0x99, has been munged into two bytes by a bad Copy/Paste. You aren't going to get it back through character set encoding hoop jumping.
While the hoop jumping could work for some documents, you will surely find many things that are even more poorly re-encoded. Your best bet is going to be conducting a byte-level search and replace operation, looking for incorrectly encoded sequences and replacing them with a plain-ASCII or properly UTF-8 encoded alternative. There are lots of ways that the encoding would be wrong. For example, if the corruption source was in the ISO-8859 family, the final corrupted sequence would have been different, or perhaps the final ™ might not be munched into t and m in certain places.
A byte-level search and replace is guaranteed only to impact incorrectly re-encoded sequences, and will not leave the risk of munching on single-encoded UTF-8 characters that can't be represented in inferior character sets. It's safer and faster.
edit: I totally didn't actually catch that you were already planning on doing this. ;) Unfortunately I've never seen such a handy list. Perhaps you should publish and publicize your work so that others may benefit. yourcharacterencodingsucks.com is available!
What is better for PHP developers - Unicode or UTF-8?
I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.
What encoding format is better for browser recognition and for DB data storage?
"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.
UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.
UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).
I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's strlen() on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions like substr() is precarious because if you split in the middle of a multi-byte character you corrupt the string.
Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.
(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)
UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.
Microsoft recommends that
Developers should use UTF-8 for all
Unicode data that they send to and
receive from the browser.
For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.
Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.
I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as htmlspecialchars can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security holes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.
It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.