Encoding/charset associated to PHP strings

Encoding/charset associated to PHP strings - php

The PHP documentation says:
Of course, in order to be useful, functions that operate on text may have to make some assumptions about how the string is encoded. Unfortunately, there is much variation on this matter throughout PHP’s functions:
[... a few special cases are described ...]
Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data [...]
Source: https://www.php.net/manual/en/language.types.string.php
So naturally my question is: Where are these specifications that allow us to identify the encoding/charset associated to string arguments, return values, constants, array keys/values, ... for built-in functions/methods/data (e.g. array_key_exists, DOMDocument::getElementsByTagName, DateTime::format, $_GET[$key], ini_set, PDO::__construct, json_decode, Exception::getMessage() and many more)? How do composer package providers specify the encodings in which they accept/provide textual data?
I have been working roughly with the following heuristic: (1) never change the encoding of anything, (2) when forced to pick an encoding, pick UTF-8. This has been working for years but it feels very unsatisfactory.
Whenever I try to find an answer to the question, I only get search results relating to url encoding, HTML entities or explaining the interpretation of string literals (with the source file's encoding).

Strings in PHP are what other languages would call byte arrays, i.e. purely a raw sequence of bytes. PHP is not generally interested in what characters those bytes represent, they're just bytes. Only functions that need to work with strings on a character level need to be aware of the encoding, anything else doesn't.
For example, array_key_exists doesn't need to know anything about characters to figure out whether a key with the same bytes as the given string exists in an array.
However, mb_strlen for example explicitly tells you how many characters the string consists of, so it needs to interpret the given string in a specific encoding to give you the right number of characters. mb_strlen('漢字', 'latin1') and mb_strlen('漢字', 'utf-8') give very different results. There isn't a unified way how these kinds of functions are made encoding aware*, you will need to consult their manual entries.
* The mb_ functions in particular generally use mb_internal_encoding(), but other sets of functions won't.
Functions like DateTime::format are looking for specific characters in the format string to replace by date values, e.g. d for the day, m for the month etc. You can generally assume that these are ASCII byte values it's looking for, unless specified otherwise (and I'm not aware of anything that specifies otherwise). So any ASCII compatible encoding will usually do.
For a lot more details, you may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Often this can be found in the official documentation, e.g., the DOMDocument class has a property encoding (determined by XML declaration). As for methods that return strings, I recommend reading this

Related

Safely concatenate multibyte strings

I'm looking increasingly into ensuring that PHP apps are multibyte-safe, which mostly involves replacing string manipulation functions with their equivilant mb_* functions.
However string concatenation is giving me pause for thought.
Some character encodings (such as UTF-16 unicode) can include a Byte Order Mark at the beginning. If you concatenated two UTF16 strings it's possible you'd introduce a BOM into the resulting string at a location other than the beginning. I suspect that there are other encodings that can also include "header" information such that stitching two strings of the same encoding together would also be problematic. Is PHP smart enough to discard BOMs etc when doing multibyte string concatenations? I suspect not because PHP has traditionally only treated strings as a sequence of bytes. Is there a multibyte-safe equivalent to concatenation? I've not been able to find anything in the mbstring documentation.
Obviously it would never be safe to concatenate strings that are in different encodings so I'm not worrying about that for now.

PHP has traditionally only treated strings as a sequence of bytes
It still does. PHP has no concept of a character string, as it exists in other languages. Therefore, all strings are always byte strings and you will need to manually track which of them are binary strings, which are character strings and which encoding is being used. An effort to bring Unicode strings to PHP resulted in PHP 6, which was abandoned and never released. But then again, even languages which do have native character strings will not automatically do what you're asking for anyway.
Have a look at the Unicode FAQ about BOM, some of the information below directly comes from there.
If a byte order mark ends up in the middle of a string, Unicode dictates that it should be interpreted as ZERO WIDTH NON-BREAKING SPACE. I conclude that this shouldn't usually be an issue, so it's not so terrible to just ignore BOMs.
However, if that bothers you, my recommendation is as follows:
Try to avoid BOMs altogether and mark the data stream accordingly. For example, when using HTTP, set the encoding to UTF-16BE or UTF-16LE using headers.
Sanitize all inputs used by the application (user input, files loaded, ...) as early as possible, by removing these BOMs and converting the encoding. You may even want to use the Normalizer class. Use your favorite framework's features if available.
Use one and only one encoding internally. Use mb_internal_encoding() to set a default for all mb_*() functions.
When outputting strings, add any desired BOMs back to the strings if you must. Again, it's preferable to just mark the data stream correctly.
That said, note that concatenating multi-byte strings can cause a multitude of unexpected situations, the BOM in the middle of a string is just one of them. Issues may also arise when using bidirectional text where RTL or LTR code points in the first string being concatenated may affect text in the second string. Further, many issues may also arise when using other string operations, for example using mb_substr() on bidirectional text may also produce unexpected results. Text involving combining diacritical marks may also be problematic.

How come that two identically encoded words look different in htmlentities?

I have a question concerning UTF-8 and htmlentities. I have two variables with greek text, both of them seem to be UTF-8 encoded (according to mb_detect_encoding()). When I output the two variables, they look exactly the same in the browser (also in the source code).
I was astonished when I realized, that a simple if($var1 == $var2) always failed although they seemed to be exactly the same. So I used htmlentities to see whether the html code would be the same. I was surprised when I saw that the first variable looked like this: Ï�ÎºÏ�Î»Î¿Ï� and the other one like this: ια&ro;. How can it be that two identical words with the same encoding (UTF-8) are nevertheless different? And how could I fix this problem?

Your first question was: How can it be that two identical words with the same encoding (UTF-8) are nevertheless different?
In this case, the encoding isn't really UTF-8 in both cases. The first variable is in "real" UTF-8, while in the second, greek characters are not really in UTF-8, but in ASCII, with non-ASCII characters (greek) encoded using something called a CER (Character Entity Reference).
A web browser and some too friendly "WYSIWYG" editors will render these strings as identical, but the binary representations of the actual strings (which is what the computer will compare) are different. This is why the equal test fails, even if the strings appear to be the same upon human visual ispection in a browser or editor.
I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way of telling utf-8 apart from ASCII using CER to represent non-ASCII.
Your second question was: How could I fix this problem?
Before you can compare strings that may be encoded differently, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.
Here is how I've solved it: I've implemented a handy function named utf8_normalize that converts just about any common character representation (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there must to some extent be determined by what are "popular" character representations in the type of environment your software will operate, but if you just make sure that your strings are on canonical form before comparing, it will work.
As noted in the comment below from the OP (phpheini), there also exists the PHP Normalizer class, which may do a better job of normalization that a home-grown function.

PHP utf-8 best practices and risks for distributed web applications

I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.

First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

Declaration to make PHP script completely Unicode-friendly

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.

All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)

Strange behaviour of mb_detect_order() in PHP

I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.
The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.
Consider the following example
$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'
However if you change the order of encodings in mb_detect_order() the results will be different:
mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
die($originalEncoding); // $originalEncoding = 'EUC-JP'
So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

That's what I would expect to happen.
The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.
Something more intelligent requires statistical methods (I think machine learning is commonly used).
EDIT: See e.g. this article for more intelligent methods.
Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.
For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string "hello" would have an identical sequence of bytes in both encodings.
This is exactly why there is an mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like "hello" to be utf-8 or ISO-8859-1?

Keep in mind mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is - e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.
As you can imagine, given that context, it's quite difficult to detect an encoding reliably.
Like rihk said, this is what the mb_detect_order() function is for - you're basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn't likely to be UTF-16 even if mb_detect_encoding() could guess it as that.
You might also want to check out Artefacto's link for a more in-depth view.
Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (#link, Section: 'To automatically detect a website's language') that's caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.

mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.
The wikipedia list of charsets is a good place to see the characters that make up each charset.
Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.
As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.
If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.