How to convert malformed database characters (ascii to utf-8)

How to convert malformed database characters (ascii to utf-8) - php

I know many people will say this has already been answered like so https://stackoverflow.com/a/4983999/1833322 But let me explain why it's not just as straight forwarded..
I would like to use PHP to convert something "that looks like ascii" into "utf-8"
There is a website which does this https://onlineutf8tools.com/convert-ascii-to-utf8
When i input this string Zâ€¦Z i get back Z⬦Z which is the correct output.
I tried iconv and some mb_ functions. Though i can't figure out if these functions are capable of doing what i want or which options that i need. If it's not possible with these functions some self-written PHP code would be appreciated. (The website runs javascript and i don't think PHP i less capable in this regard)
To be clear: the goal is to recreate in PHP what that website is doing. Not to have a semantic debate about ascii and utf-8
EDIT: the website uses https://github.com/mathiasbynens/utf8.js which says
it can encode/decode any scalar Unicode code point values, as per the Encoding Standard.
Standard linking to https://encoding.spec.whatwg.org/#utf-8 So this library says it implements the standard, then what about PHP ?

UTF-8 is a superset of ASCII so converting from ASCII to UTF-8 is like converting a car into a vehicle.
+--- UTF-8 ---------------+
| |
| +--- ASCII ---+ |
| | | |
| +-------------+ |
+-------------------------+
The tool you link seems to be using the term "ASCII" as synonym for mojibake (it says "car" but means "scrap metal"). Mojibake typically happens this way:
You pick a non-English character: ⬦ 'WHITE MEDIUM DIAMOND' (U+2B26)
You encode it using UTF-8: 0xE2 0xAC 0xA6
You open the stream in a tool that's configured to use the single-byte encoding that's widely used in your area: Windows-1252
You look up the individual bytes of the UTF-8 character in the character table of the single-byte encoding:
0xE2 -> â
0xAC -> ¬
0xA6 -> ¦
You encode the resulting characters in UTF-8:
â = 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2) = 0xC3 0xA2
¬ = NOT SIGN' (U+00AC) = 0xC2 0xAC
¦ = 'BROKEN BAR' (U+00A6) = 0xC2 0xA6
Thus you've transformed the UTF-8 stream 0xE2 0xAC 0xA6 (⬦) into the also UTF-8 stream 0xC3 0xA2 0xC2 0xAC 0xC2 0xA6 (â¬¦).
To undo this you need to reverse the steps. That's straightforward if you know what proxy encoding was used (Windows-1252 in my example):
$mojibake = "\xC3\xA2\xC2\xAC\xC2\xA6";
$proxy = 'Windows-1252';
var_dump($mojibake, bin2hex($mojibake));
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
var_dump($original, bin2hex($original));
string(6) "â¬¦"
string(12) "c3a2c2acc2a6"
string(3) "⬦"
string(6) "e2aca6"
But it's tricky if you don't. I guess you can:
Compile a dictionary of the different byte sequences you get in the different single-byte encodings and then use some kind of bayesian inference to figure out the most likely encoding. (I can't really help you with this.)
Try the most likely encodings and visually inspect the output to determine which is correct:
// Source code saved as UTF-8
$mojibake = "Zâ€¦Z";
foreach (mb_list_encodings() as $proxy) {
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
echo $proxy, ': ', $original, PHP_EOL;
}
If (as in your case) you know what the original text is and you're kind of sure that you don't have mixed encodings, do as #2 but trying all the encodings PHP supports:
// Source code saved as UTF-8
$mojibake = 'Zâ€¦Z';
$expected = 'Z⬦Z';
foreach (mb_list_encodings() as $proxy) {
$current = #mb_convert_encoding($mojibake, $proxy, 'UTF-8');
if ($current === $expected) {
echo "$proxy: match\n";
}
}
(This prints wchar: match; not really sure what that means.)

Related

How to display the (extended) ASCII representation of a special character in PHP 5.6?

I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?

You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.

ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));

bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).

Is testing for UTF-8 strings in PHP a reliable method?

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.
This function seems to work in all cases I tested.
Here is the question: Get file encoding
Here is the function:
function isUTF8($string){
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # Non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # Excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # Straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # Excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # Planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # Planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # Plane 16
)+%xs', $string);
}
Is this a reliable way of detecting UTF-8 strings?
What exactly is it doing? Can it be made more robust?

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.
In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.

Well, it only checks if the string has byte sequences that happen to correspond to valid UTF-8 code points. However, it won't flag the sequence 0x00-0x7F which is the ASCII compatible subset of UTF-8.
EDIT: Incidentally I am guessing the reason thought mb_detect_encoding() "didn't work properly" was because your Latin-1 encoded file only used the ASCII compatible subset, which also is valid in UTF-8. It's no wonder that mb_detect_encoding() would flag that as UTF-8 and it is "correct", if the data is just plain ASCII then the answer UTF-8 is as good as Latin-1, or ASCII, or any of the myriad extended ASCII encodings.

That will just detect if part of the string is a formally valid UTF-8 sequence, ignoring one code unit encoded characters (representing code points in ASCII). For that function to return true it suffices that there's one character that looks like a non-ASCII UTF-8 encoded character.

Basically, no.
Any UTF8 string is a valid 8-bit encoding string (even if it produces gibberish).
On the other hand, most 8-bit encoded strings with extended (128+) characters are not valid UTF8, but, as any other random byte sequence, they might happen to be.
And, of couse, any ASCII text is valid UTF8, so mb_detect_encoding is, in fact, correct by saying so. And no, you won't have any problems using ASCII text as UTF8. It's the reason UTF8 works in the first place.
As far as I understand, the function you supplied does not check for validity of the string, just that it contains some sequences that happen to be similar to those of UTF8, thus this function might misfire much worse. You may want to use both this function and mb_detect_encoding in strict mode and hope that they cancel out each others false positives.
If the text is written in a non-latin alphabet, a "smart" way to detect a multibyte encoding is to look for sequences of equally sized chunks of bytes starting with the same bits. For example, Russian word "привет" looks like this:
11010000 10111111
11010001 10000000
11010000 10111000
11010000 10110010
11010000 10110101
11010001 10000010
This, however, won't work for latin-based alphabets (and, probably, Chinese).

The function in question (the one that the user pilif posted in the linked question) appears to have been taken from this comment on the mb_detect_encoding() page in the PHP Manual:
As the author states, the function is only meant to "check if a string contains UTF-8 characters" and it only looks for "non-ascii multibyte sequences in the UTF-8 range". Therefore, the function returns false (zero actually) if your string just contains simple ascii characters (like english text), which is probably not what you want.
His function was based on another function in this previous comment on that same page which is, in fact, meant to check if a string is UTF-8 and was based on this regular expression created by someone at W3C.
Here is the original, correctly working (I've tested) function that will tell you if a string is UTF-8:
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8

This may not be the answer to your question (maybe it is, see the update below), but it could be the answer to your problem. Check out my Encoding class that has methods to convert strings to UTF8, no matter if they are encoded in Latin1, Win1252, or UTF8 already, or a mix of them:
Encoding::toUTF8($text_or_array);
Encoding::toWin1252($text_or_array);
Encoding::toISO8859($text_or_array);
// fixes UTF8 strings converted to UTF8 repeatedly:
// "FÃÂÃÂ©dÃÂÃÂ©ration" to "Fédération"
Encoding::fixUTF8($text_or_array);
https://stackoverflow.com/a/3479832/290221
The function runs byte by byte and figure out if each one of them needs conversion or not.
Update:
Thinking a little bit more about it, this could in fact be the answer to your question:
require_once('Encoding.php');
function validUTF8($string){
return Encoding::toUTF8($string) == $string;
}
And here is the Encoding class:
https://github.com/neitanod/forceutf8

URL encoding PHP

I tested urlencode() and rawurlencode() out and they produce different result, like in Firefox and some online encoders...
Example;
Firefox & encoders
ä = %C3%A4
ß = %C3%9F
PHP rawurlencode() and urlencode():
ß = %DF
ä = %E4
What can I do, except hard coding and replacing?

They produce different outputs because you provided different inputs, i.e., different character encodings: Firefox uses UTF-8 and your PHP script uses Windows-1252. Although in both character sets the characters are at the same position (ß=0xDF, ä=0xE4), i.e., the have the same code point, they encode that code point differently:
CP | UTF-8 | Windows-1252
------+--------+--------------
0xDF | 0xC39F | 0xDF
0xE4 | 0xC3A4 | 0xE4
Use the same character encoding (preferably UTF-8) and you’ll get the same result.

Maybe Base64 encode, and use in post, to not make visitors afraid of these URLs.

mb_detect_encoding() discrepancy for non latin1 characters

I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.
Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:
$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);
print $res;
I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?

I wanted to be funny and say hexdump could explain it:
0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b
But alas, that's quite the opposite.
In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.
Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.
I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:
preg_match('/[\x7F-\x9F]/', $str);
I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1

Ensuring valid UTF-8 in PHP

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?
For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?
function make_safe_for_utf8_use($string) {
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
if ($encoding != 'UTF-8') {
return iconv($encoding, 'UTF-8//TRANSLIT', $string);
}
else {
return $string;
}
}

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.
Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.
would this code ensure that a string is safe to insert into a UTF-8 encoded document
You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.
If you want to be sure, do it yourself using the W3-recommended regex:
if (preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string))
return $string;
else
return iconv('CP1252', 'UTF-8', $string);

With the mbstring library, you have mb_check_encoding().
Example of use:
mb_check_encoding($string, 'UTF-8');
However, with PHP 7.1.9 on a recent Windows 10 system, the regex solution now outperforms mb_check_encoding() for any string length (tested on 20,000 iterations):
10 characters: regex => 4 ms, mb_check_encoding() => 64 ms
10000 chars: regex => 125 ms, mb_check_encoding() => 2.4 s

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:
<?php
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}

Answer to "iconv is idempotent":
Neither is iconv - iconv is not idempotent.
A big difference between utf8_encode() and iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:
iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)
in the above code:
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
You have to know mb_detect_encoding. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about character sets. This page links to a page specifically for UTF-8.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to convert malformed database characters (ascii to utf-8) - php

Related

How to display the (extended) ASCII representation of a special character in PHP 5.6?

Is testing for UTF-8 strings in PHP a reliable method?

URL encoding PHP

mb_detect_encoding() discrepancy for non latin1 characters

Ensuring valid UTF-8 in PHP

Categories

Resources