Ensuring valid UTF-8 in PHP - php

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?
For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?
function make_safe_for_utf8_use($string) {
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
if ($encoding != 'UTF-8') {
return iconv($encoding, 'UTF-8//TRANSLIT', $string);
}
else {
return $string;
}
}

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.
Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.
would this code ensure that a string is safe to insert into a UTF-8 encoded document
You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.
If you want to be sure, do it yourself using the W3-recommended regex:
if (preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string))
return $string;
else
return iconv('CP1252', 'UTF-8', $string);

With the mbstring library, you have mb_check_encoding().
Example of use:
mb_check_encoding($string, 'UTF-8');
However, with PHP 7.1.9 on a recent Windows 10 system, the regex solution now outperforms mb_check_encoding() for any string length (tested on 20,000 iterations):
10 characters: regex => 4 ms, mb_check_encoding() => 64 ms
10000 chars: regex => 125 ms, mb_check_encoding() => 2.4 s

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:
<?php
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}

Answer to "iconv is idempotent":
Neither is iconv - iconv is not idempotent.
A big difference between utf8_encode() and iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:
iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)
in the above code:
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
You have to know mb_detect_encoding. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about character sets. This page links to a page specifically for UTF-8.

Related

How to convert malformed database characters (ascii to utf-8)

I know many people will say this has already been answered like so https://stackoverflow.com/a/4983999/1833322 But let me explain why it's not just as straight forwarded..
I would like to use PHP to convert something "that looks like ascii" into "utf-8"
There is a website which does this https://onlineutf8tools.com/convert-ascii-to-utf8
When i input this string Z…Z i get back Z⬦Z which is the correct output.
I tried iconv and some mb_ functions. Though i can't figure out if these functions are capable of doing what i want or which options that i need. If it's not possible with these functions some self-written PHP code would be appreciated. (The website runs javascript and i don't think PHP i less capable in this regard)
To be clear: the goal is to recreate in PHP what that website is doing. Not to have a semantic debate about ascii and utf-8
EDIT: the website uses https://github.com/mathiasbynens/utf8.js which says
it can encode/decode any scalar Unicode code point values, as per the Encoding Standard.
Standard linking to https://encoding.spec.whatwg.org/#utf-8 So this library says it implements the standard, then what about PHP ?
UTF-8 is a superset of ASCII so converting from ASCII to UTF-8 is like converting a car into a vehicle.
+--- UTF-8 ---------------+
| |
| +--- ASCII ---+ |
| | | |
| +-------------+ |
+-------------------------+
The tool you link seems to be using the term "ASCII" as synonym for mojibake (it says "car" but means "scrap metal"). Mojibake typically happens this way:
You pick a non-English character: ⬦ 'WHITE MEDIUM DIAMOND' (U+2B26)
You encode it using UTF-8: 0xE2 0xAC 0xA6
You open the stream in a tool that's configured to use the single-byte encoding that's widely used in your area: Windows-1252
You look up the individual bytes of the UTF-8 character in the character table of the single-byte encoding:
0xE2 -> â
0xAC -> ¬
0xA6 -> ¦
You encode the resulting characters in UTF-8:
â = 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2) = 0xC3 0xA2
¬ = NOT SIGN' (U+00AC) = 0xC2 0xAC
¦ = 'BROKEN BAR' (U+00A6) = 0xC2 0xA6
Thus you've transformed the UTF-8 stream 0xE2 0xAC 0xA6 (⬦) into the also UTF-8 stream 0xC3 0xA2 0xC2 0xAC 0xC2 0xA6 (⬦).
To undo this you need to reverse the steps. That's straightforward if you know what proxy encoding was used (Windows-1252 in my example):
$mojibake = "\xC3\xA2\xC2\xAC\xC2\xA6";
$proxy = 'Windows-1252';
var_dump($mojibake, bin2hex($mojibake));
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
var_dump($original, bin2hex($original));
string(6) "⬦"
string(12) "c3a2c2acc2a6"
string(3) "⬦"
string(6) "e2aca6"
But it's tricky if you don't. I guess you can:
Compile a dictionary of the different byte sequences you get in the different single-byte encodings and then use some kind of bayesian inference to figure out the most likely encoding. (I can't really help you with this.)
Try the most likely encodings and visually inspect the output to determine which is correct:
// Source code saved as UTF-8
$mojibake = "Z…Z";
foreach (mb_list_encodings() as $proxy) {
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
echo $proxy, ': ', $original, PHP_EOL;
}
If (as in your case) you know what the original text is and you're kind of sure that you don't have mixed encodings, do as #2 but trying all the encodings PHP supports:
// Source code saved as UTF-8
$mojibake = 'Z…Z';
$expected = 'Z⬦Z';
foreach (mb_list_encodings() as $proxy) {
$current = #mb_convert_encoding($mojibake, $proxy, 'UTF-8');
if ($current === $expected) {
echo "$proxy: match\n";
}
}
(This prints wchar: match; not really sure what that means.)

php preg_replace: unicode modifier for ascii strings

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

Is testing for UTF-8 strings in PHP a reliable method?

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.
This function seems to work in all cases I tested.
Here is the question: Get file encoding
Here is the function:
function isUTF8($string){
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # Non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # Excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # Straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # Excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # Planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # Planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # Plane 16
)+%xs', $string);
}
Is this a reliable way of detecting UTF-8 strings?
What exactly is it doing? Can it be made more robust?
If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.
In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.
Well, it only checks if the string has byte sequences that happen to correspond to valid UTF-8 code points. However, it won't flag the sequence 0x00-0x7F which is the ASCII compatible subset of UTF-8.
EDIT: Incidentally I am guessing the reason thought mb_detect_encoding() "didn't work properly" was because your Latin-1 encoded file only used the ASCII compatible subset, which also is valid in UTF-8. It's no wonder that mb_detect_encoding() would flag that as UTF-8 and it is "correct", if the data is just plain ASCII then the answer UTF-8 is as good as Latin-1, or ASCII, or any of the myriad extended ASCII encodings.
That will just detect if part of the string is a formally valid UTF-8 sequence, ignoring one code unit encoded characters (representing code points in ASCII). For that function to return true it suffices that there's one character that looks like a non-ASCII UTF-8 encoded character.
Basically, no.
Any UTF8 string is a valid 8-bit encoding string (even if it produces gibberish).
On the other hand, most 8-bit encoded strings with extended (128+) characters are not valid UTF8, but, as any other random byte sequence, they might happen to be.
And, of couse, any ASCII text is valid UTF8, so mb_detect_encoding is, in fact, correct by saying so. And no, you won't have any problems using ASCII text as UTF8. It's the reason UTF8 works in the first place.
As far as I understand, the function you supplied does not check for validity of the string, just that it contains some sequences that happen to be similar to those of UTF8, thus this function might misfire much worse. You may want to use both this function and mb_detect_encoding in strict mode and hope that they cancel out each others false positives.
If the text is written in a non-latin alphabet, a "smart" way to detect a multibyte encoding is to look for sequences of equally sized chunks of bytes starting with the same bits. For example, Russian word "привет" looks like this:
11010000 10111111
11010001 10000000
11010000 10111000
11010000 10110010
11010000 10110101
11010001 10000010
This, however, won't work for latin-based alphabets (and, probably, Chinese).
The function in question (the one that the user pilif posted in the linked question) appears to have been taken from this comment on the mb_detect_encoding() page in the PHP Manual:
As the author states, the function is only meant to "check if a string contains UTF-8 characters" and it only looks for "non-ascii multibyte sequences in the UTF-8 range". Therefore, the function returns false (zero actually) if your string just contains simple ascii characters (like english text), which is probably not what you want.
His function was based on another function in this previous comment on that same page which is, in fact, meant to check if a string is UTF-8 and was based on this regular expression created by someone at W3C.
Here is the original, correctly working (I've tested) function that will tell you if a string is UTF-8:
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
This may not be the answer to your question (maybe it is, see the update below), but it could be the answer to your problem. Check out my Encoding class that has methods to convert strings to UTF8, no matter if they are encoded in Latin1, Win1252, or UTF8 already, or a mix of them:
Encoding::toUTF8($text_or_array);
Encoding::toWin1252($text_or_array);
Encoding::toISO8859($text_or_array);
// fixes UTF8 strings converted to UTF8 repeatedly:
// "FÃÂédÃÂération" to "Fédération"
Encoding::fixUTF8($text_or_array);
https://stackoverflow.com/a/3479832/290221
The function runs byte by byte and figure out if each one of them needs conversion or not.
Update:
Thinking a little bit more about it, this could in fact be the answer to your question:
require_once('Encoding.php');
function validUTF8($string){
return Encoding::toUTF8($string) == $string;
}
And here is the Encoding class:
https://github.com/neitanod/forceutf8

php unicode 16 bit

how can I append a 16 bit unicode character to a string in php
$test = "testing" . (U + 199F);
From what I see, \x only takes 8 bit characters aka ascii
From the manual:
PHP only supports a 256-character set, and hence does not offer native Unicode support.
You could enter a manually-encoded UTF-8 sequence, I suppose.
You can also type out UCS4 as byte sequence and use iconv("UTF-32LE", "UTF-8", $str); to convert it into UTF-8 for further processing. You just can't input the codepoint as a 32-bit code unit in one go.
Unicode characters don't directly exist in PHP(*), but you can deal with strings containing bytes represent characters in UTF-8 encoding. Here's one way of converting a numeric character code point to UTF-8:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$test= 'testing'.unichr(0x199F);
(*: and ‘16-bit’ Unicode characters don't exist at all; Unicode has code points way beyond U+FFFF. There are 16-bit ‘code units’ in UTF-16, but that's an ugly encoding you're unlikely to meet in PHP.)
Because unicode is just multibyte and PHP only supports single byte you can create multibyte characters with multiple single bytes :)
$test = "testing\x19\x9F";
Try:
$test = "testing" . "\u199F";

Convert ASCII TO UTF-8 Encoding

How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html

Categories