json_encode() with large Unicode characters [duplicate]

json_encode() with large Unicode characters [duplicate] - php

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?

One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

Related

What is the correct .NET encoding format to use to decode a string that has been encoded using PHPs rawurlencode?

I have a bunch of strings which I'm told have been encoded using the rawurlencode function in PHP.
Some of these strings contain percent encoded sequences for characters above unicode codepoint 127 - e.g. a%A0b.
I think the A0 in the above example is meant to represent a non-breaking space (Unicode codepoint 160 0xA0) but A0 on it's own is not a valid UTF-8 sequence (any byte with the high bit set (>127) is part of a multi-sequence). Thus .NET decodes this to ? by default.
I have tried a few different encodings. iso-8859-1 seems to fit, but I can't be sure.
This URL encoded string will contain non-english characters so it is critical that the conversion happens properly.
Which is the correct encoding to pass to System.Web.HttpUtility.ParseQueryString to decode a string that has been encoded with rawurlencode?

PHP's native string type is plain old bytes, with no encoding information attached. So rawurlencode doesn't do any handling of Unicode, it just hex-escapes each high byte to %xx.
If the application wants to treat those bytes as a representation of characters, it's up to the application to decide what encoding is in use. It would be lovely if the application told you that in the documentation, and it would be lovely if that encoding were UTF-8 which is the only sane choice. But apparently not.
iso-8859-1 seems to fit, but I can't be sure.
There are a lot of encodings that map character U+00A0 Non Breaking Space to byte 0xA0, including all the ISO-8859s and all the Windows code pages that are based on them. True ISO-8859-1 is relatively rare on the web, you're more likely to meet its mutant cousin Windows Western code page 1252 (GetEncoding(1252)).
The only way to tell would be to enter different characters into the application and see what comes out. What “non-English” characters are you expecting, any particular language?

Strange umlaut encoding on file system

From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.
For example:
Volkszählung instead of Volkszählung (try to use Backspace after the first ä).
When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.
I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?

That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.
(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)
If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.
You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.
Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.

What is better for PHP developers - Unicode or UTF-8?

What is better for PHP developers - Unicode or UTF-8?
I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.
What encoding format is better for browser recognition and for DB data storage?

"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.

UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.
UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).
I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's strlen() on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions like substr() is precarious because if you split in the middle of a multi-byte character you corrupt the string.
Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.
(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)

UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.
Microsoft recommends that
Developers should use UTF-8 for all
Unicode data that they send to and
receive from the browser.
For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.

Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.
I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as htmlspecialchars can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security holes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.

It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.

Is PHP's json_encode guaranteed to produce ASCII string?

Well, the subject says everything. I'm using json_encode to convert some UTF8 data to JSON and I need to transfer it to some layer that is currently ASCII-only. So I wonder whether I need to make it UTF-8 aware, or can I leave it as it is.
Looking at JSON rfc, UTF8 is also valid charset in JSON output, although not recommended, i.e. some implemenatations can leave UTF8 data inside. The question is whether PHP's implementation dumps everthing as ASCII or opts to leave something as UTF-8.

Unlike JSON support in other languages, json_encode() does not have the ability to generate anything other than ASCII.

According to the JSON article in Wikipedia, Unicode characters in strings are always
double-quoted Unicode with backslash escaping
The examples in the PHP Manual on json_encode() seem to confirm this.
So any UTF-8 character outside ASCII/ANSI should be escaped like this: \u0027 (note, as #Ignacio points out in the comments, that this is the recommended way to deal with those characters, not a required one)
However, I suppose json_decode() will convert the characters back to their byte values? You may get in trouble there.
If you need to be sure, take a look at iconv() that could convert your UTF-8 String into ASCII (dropping any unsupported characters) beforehand.

Well, json_encode returns a string. According to the PHP documentation for string:
A string is series of characters. Before PHP 6, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality.
So for the time being you do not need to worry about making it UTF-8 aware. Of course you still might want to think about this anyway, to future-proof your code.

What is difference between UTF-8 and HTML entities?

What is difference between UTF-8 and HTML entities?

The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).
Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.
Text containing only characters covered by ASCII:
Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)
Text containing European characters not covered by ASCII:
Beträge: 20€ (UTF-8)
Beträge: 20€ (ASCII with HTML entities)
Text containing Asian characters, most certainly not covered by ASCII:
値段：二千円 (UTF-8)
値段：二千円 (ASCII with HTML entities)
The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.
The problem with HTML entities is that normal characters take on a special meaning. When writing ä, it takes on the special meaning of "ä". If you actually intend to write "ä", you need to double encode the sequence as &auml;.
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.
The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. text is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "text", you will need to encode it as text, so the HTML parser doesn't mistake it for HTML tags.

See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)
Each natural number just happens to represent a 'character'.
HTML entities is a way to represent those same natural numbers in a way like: , stands for the natural number 127, in unicode that being the DEL character.
In UTF-8 that's the bytestream: 0111 1111
Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111.
Two DEL chars in a row become 0111 1111 0111 1111. UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.

UTF-8 is an encoding scheme for byte-level encoding.
HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them more human readable readable when UTF-8 is not available.
The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators (< or >) when placed in a certain order (i.e <text>) can accidentally render as HTML when the intent was for them to render as text.

A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, > outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.
UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.

UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.