Safely concatenate multibyte strings

Safely concatenate multibyte strings - php

I'm looking increasingly into ensuring that PHP apps are multibyte-safe, which mostly involves replacing string manipulation functions with their equivilant mb_* functions.
However string concatenation is giving me pause for thought.
Some character encodings (such as UTF-16 unicode) can include a Byte Order Mark at the beginning. If you concatenated two UTF16 strings it's possible you'd introduce a BOM into the resulting string at a location other than the beginning. I suspect that there are other encodings that can also include "header" information such that stitching two strings of the same encoding together would also be problematic. Is PHP smart enough to discard BOMs etc when doing multibyte string concatenations? I suspect not because PHP has traditionally only treated strings as a sequence of bytes. Is there a multibyte-safe equivalent to concatenation? I've not been able to find anything in the mbstring documentation.
Obviously it would never be safe to concatenate strings that are in different encodings so I'm not worrying about that for now.

PHP has traditionally only treated strings as a sequence of bytes
It still does. PHP has no concept of a character string, as it exists in other languages. Therefore, all strings are always byte strings and you will need to manually track which of them are binary strings, which are character strings and which encoding is being used. An effort to bring Unicode strings to PHP resulted in PHP 6, which was abandoned and never released. But then again, even languages which do have native character strings will not automatically do what you're asking for anyway.
Have a look at the Unicode FAQ about BOM, some of the information below directly comes from there.
If a byte order mark ends up in the middle of a string, Unicode dictates that it should be interpreted as ZERO WIDTH NON-BREAKING SPACE. I conclude that this shouldn't usually be an issue, so it's not so terrible to just ignore BOMs.
However, if that bothers you, my recommendation is as follows:
Try to avoid BOMs altogether and mark the data stream accordingly. For example, when using HTTP, set the encoding to UTF-16BE or UTF-16LE using headers.
Sanitize all inputs used by the application (user input, files loaded, ...) as early as possible, by removing these BOMs and converting the encoding. You may even want to use the Normalizer class. Use your favorite framework's features if available.
Use one and only one encoding internally. Use mb_internal_encoding() to set a default for all mb_*() functions.
When outputting strings, add any desired BOMs back to the strings if you must. Again, it's preferable to just mark the data stream correctly.
That said, note that concatenating multi-byte strings can cause a multitude of unexpected situations, the BOM in the middle of a string is just one of them. Issues may also arise when using bidirectional text where RTL or LTR code points in the first string being concatenated may affect text in the second string. Further, many issues may also arise when using other string operations, for example using mb_substr() on bidirectional text may also produce unexpected results. Text involving combining diacritical marks may also be problematic.

Related

Encoding/charset associated to PHP strings

The PHP documentation says:
Of course, in order to be useful, functions that operate on text may have to make some assumptions about how the string is encoded. Unfortunately, there is much variation on this matter throughout PHP’s functions:
[... a few special cases are described ...]
Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data [...]
Source: https://www.php.net/manual/en/language.types.string.php
So naturally my question is: Where are these specifications that allow us to identify the encoding/charset associated to string arguments, return values, constants, array keys/values, ... for built-in functions/methods/data (e.g. array_key_exists, DOMDocument::getElementsByTagName, DateTime::format, $_GET[$key], ini_set, PDO::__construct, json_decode, Exception::getMessage() and many more)? How do composer package providers specify the encodings in which they accept/provide textual data?
I have been working roughly with the following heuristic: (1) never change the encoding of anything, (2) when forced to pick an encoding, pick UTF-8. This has been working for years but it feels very unsatisfactory.
Whenever I try to find an answer to the question, I only get search results relating to url encoding, HTML entities or explaining the interpretation of string literals (with the source file's encoding).

Strings in PHP are what other languages would call byte arrays, i.e. purely a raw sequence of bytes. PHP is not generally interested in what characters those bytes represent, they're just bytes. Only functions that need to work with strings on a character level need to be aware of the encoding, anything else doesn't.
For example, array_key_exists doesn't need to know anything about characters to figure out whether a key with the same bytes as the given string exists in an array.
However, mb_strlen for example explicitly tells you how many characters the string consists of, so it needs to interpret the given string in a specific encoding to give you the right number of characters. mb_strlen('漢字', 'latin1') and mb_strlen('漢字', 'utf-8') give very different results. There isn't a unified way how these kinds of functions are made encoding aware*, you will need to consult their manual entries.
* The mb_ functions in particular generally use mb_internal_encoding(), but other sets of functions won't.
Functions like DateTime::format are looking for specific characters in the format string to replace by date values, e.g. d for the day, m for the month etc. You can generally assume that these are ASCII byte values it's looking for, unless specified otherwise (and I'm not aware of anything that specifies otherwise). So any ASCII compatible encoding will usually do.
For a lot more details, you may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Often this can be found in the official documentation, e.g., the DOMDocument class has a property encoding (determined by XML declaration). As for methods that return strings, I recommend reading this

Strange umlaut encoding on file system

From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.
For example:
Volkszählung instead of Volkszählung (try to use Backspace after the first ä).
When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.
I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?

That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.
(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)
If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.
You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.
Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.

How come that two identically encoded words look different in htmlentities?

I have a question concerning UTF-8 and htmlentities. I have two variables with greek text, both of them seem to be UTF-8 encoded (according to mb_detect_encoding()). When I output the two variables, they look exactly the same in the browser (also in the source code).
I was astonished when I realized, that a simple if($var1 == $var2) always failed although they seemed to be exactly the same. So I used htmlentities to see whether the html code would be the same. I was surprised when I saw that the first variable looked like this: Ï�ÎºÏ�Î»Î¿Ï� and the other one like this: ια&ro;. How can it be that two identical words with the same encoding (UTF-8) are nevertheless different? And how could I fix this problem?

Your first question was: How can it be that two identical words with the same encoding (UTF-8) are nevertheless different?
In this case, the encoding isn't really UTF-8 in both cases. The first variable is in "real" UTF-8, while in the second, greek characters are not really in UTF-8, but in ASCII, with non-ASCII characters (greek) encoded using something called a CER (Character Entity Reference).
A web browser and some too friendly "WYSIWYG" editors will render these strings as identical, but the binary representations of the actual strings (which is what the computer will compare) are different. This is why the equal test fails, even if the strings appear to be the same upon human visual ispection in a browser or editor.
I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way of telling utf-8 apart from ASCII using CER to represent non-ASCII.
Your second question was: How could I fix this problem?
Before you can compare strings that may be encoded differently, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.
Here is how I've solved it: I've implemented a handy function named utf8_normalize that converts just about any common character representation (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there must to some extent be determined by what are "popular" character representations in the type of environment your software will operate, but if you just make sure that your strings are on canonical form before comparing, it will work.
As noted in the comment below from the OP (phpheini), there also exists the PHP Normalizer class, which may do a better job of normalization that a home-grown function.

Declaration to make PHP script completely Unicode-friendly

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.

All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)

What is better for PHP developers - Unicode or UTF-8?

What is better for PHP developers - Unicode or UTF-8?
I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.
What encoding format is better for browser recognition and for DB data storage?

"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.

UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.
UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).
I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's strlen() on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions like substr() is precarious because if you split in the middle of a multi-byte character you corrupt the string.
Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.
(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)

UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.
Microsoft recommends that
Developers should use UTF-8 for all
Unicode data that they send to and
receive from the browser.
For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.

Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.
I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as htmlspecialchars can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security holes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.

It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.