Related
I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"
I have encountered something bizarre in the mb_strwidth function; it may be a bug but I thought it better to ask here first in case I'm missing something.
Context
A class is being used to represent a generic string and is both iterable and seekable; with both iterations and seeks applying to the character within the string. The string has full multi-byte support, so when a new position is sought, it not only stores the character position, but recalculates the byte position in the string; like so:
$this->posByte = mb_strwidth(
mb_substr($this->value, 0, $pos, $this->charEncoding),
$this->charEncoding
)
Perceived Error
However, when a multi-byte character is introduced, this is returning an incorrect value. The test case is this:
$str = string('The simple sentence of the simple man; here are some multi-byte chars: Øðćă.', 'UTF-8')
$str->seek(72);
This seeks to the second multi-byte character 'ð', but the byte calculation given above returns 72, the same as the character position; whereas it should be 73 since the preceding character 'Ø' has a code point of U+00D8; which is 216 in decimal and firmly in the two-byte character range.
This is confirmed by using the multi-byte unaware function strlen() (since I have not enabled mb overloading); which simply counts the number of bytes in a string. This:
$bytePos = strlen(mb_substr($this->value, 0, $pos, $this->charEncoding));
returns 73 as expected.
Is this a known problem?
I can use strlen() for now as a workaround, but I don't particularly like doing so since enabling multi-byte overloading in the PHP config would then cause the errors to reappear; does anyone have any experience of a similar issue? Is PHP just using an out-of-date character mapping?
For the record, this is from a PHPUnit test run on a PHP 5.6.3 windows environment.
You appear to be misinterpreting the function of mb_strwidth. Its purpose has nothing to do with bytes, it merely gives you the visual width of a string according to a fixed table. This is purely interesting for Asian character sets with appropriate monospaced fonts, where latin characters, commas and other punctation are half-width and "regular" characters are full-width. Everything up to and including U+1FFF is 1.
You need to use strlen and other encoding-unaware functions to operate on strings in bytes, and mb_ functions to operate on them on a character level, to figure out your byte/character relationships.
If you're worried about the barbaric mb-overloading, either check the ini setting and refuse operation on insane systems, or use mb_strlen with a single-byte encoding set.
I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);
but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).
I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").
But there must be something that at least has a good try!
What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.
However, you could try doing this:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
Setting it to strict might help you get a better result.
In motherland Russia we have four popular encodings, so your question is in great demand here.
Only by character codes of symbols you can not detect the encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.
The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".
One guy here in a popular Russian tech blog invented this approach:
Build the probability range of character codes in every encoding you want to support. You can build it using some big texts in your language (e.g., some fiction, use Shakespeare for English and Tolstoy for Russian, LOL). You will get something like this:
encoding_1:
190 => 0.095249209893009,
222 => 0.095249209893009,
...
encoding_2:
239 => 0.095249209893009,
207 => 0.095249209893009,
...
encoding_N:
charcode => probabilty
Next, you take text in an unknown encoding and for every encoding in your "probability dictionary" you search for the frequency of every symbol in the unknown-encoded text. Sum the probabilities of symbols. Encoding with the bigger rating is likely the winner. There are better results for bigger texts.
Btw, mb_detect_encoding certainly does not work. Yes, at all. Please, take a look of the mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".
Just use the mb_convert_encoding function. It will attempt to autodetect character set of the text provided or you can pass it a list.
Also, I tried to run:
$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);
and the results are the same for both.
There is no way to identify the character set of a string that is completely accurate.
There are ways to try to guess the character set. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding. This will scan your string and look for occurrences of stuff unique to certain character sets. Depending on your string, there may not be such distinguishable occurrences.
Take the ISO-8859-1 character set vs ISO-8859-15.
There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing its encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know its exact character set.
(Note: you could add a human factor, or an even more advanced scanning technique (e.g., what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far.)
There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.
Interesting read: How do I determine the charset/encoding of a string?
There are other ways of ensuring the correct character set though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure your submission will be UTF-8 in every browser: Rails and Snowmen)
That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the Unix 'file -i' command on it through, e.g., exec() (if possible on your server) to aid the detection (using the document's BOM).
Concerning scraping data, you could read the HTTP headers, that usually specify the character set. When parsing XML files, see if the XML meta-data contain a charset definition.
Rather than trying to automagically guess the character set, you should first try to ensure a certain character set yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.
There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.
My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:
Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
If encoding cannot be detected, throw new RuntimeException
If input is UTF-8, carry on.
Else, if it is ISO-8859-1 or ASCII
a. Attempt conversion to UTF-8 (wait, not finished)
b. Detect the encoding of the converted value
c. If the reported encoding and converted value are both UTF-8, carry on.
d. Else, throw new RuntimeException
From my abstract class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.
People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in its full range.
If it's a file you won't save it UTF-8 encoded into the database, but in binary form. When you output the file again, use binary output as well, then this is totally transparent.
Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.
So I must admit I don't see a specific issue you raise with your question.
It seems that your question is quite answered, but I have an approach that may simplify you case:
I had a similar issue trying to return string data from MySQL, even configuring both database and PHP to return strings formatted to UTF-8. The only way I got the error was actually returning them from the database.
Finally, sailing through the web I found a really easy way to deal with it:
Giving that you can save all those types of string data in your MySQL in different formats and collations, you only need to, right at your php connection file, set the collation to UTF-8, like this:
$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");
Which means that first you save the data in any format or collation and you convert it only at the return to your PHP file.
If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)
enca also came up here: How to find encoding of a file in Unix via script(s)
There are a couple of libraries out there. onnov/detect-encoding looks promising. It claims to do better than mb_detect_encoding
Example usage for converting string in unknown character encoding to UTF-8:
use Onnov\DetectEncoding\EncodingDetector;
$detector->iconvXtoEncoding('Проверяемый текст')
To simply detect encoding:
$encoding = $detector->getEncoding('Проверяемый текст');
You could set up a set of metrics to try to guess which encoding is being used. Again, it is not perfect, but it could catch some of the misses from mb_detect_encoding().
Because the usage of UTF-8 is widespread, you can suppose it being the default, and when not, try to guess and convert the encoding. Here is the code:
function make_utf8(string $string)
{
// Test it and see if it is UTF-8 or not
$utf8 = \mb_detect_encoding($string, ["UTF-8"], true);
if ($utf8 !== false) {
return $string;
}
// From now on, it is a safe assumption that $string is NOT UTF-8-encoded
// The detection strictness (i.e. third parameter) is up to you
// You may set it to false to return the closest matching encoding
$encoding = \mb_detect_encoding($string, mb_detect_order(), true);
if ($encoding === false) {
throw new \RuntimeException("String encoding cannot be detected");
}
return \mb_convert_encoding($string, "UTF-8", $encoding);
}
Simple, safe and fast.
If the text is retrieved from a MySQL database, you may try adding this after the database connection.
mysqli_set_charset($con, "utf8");
mysqli::set_charset
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
If that's relevant (it very well could be), they are PHP source code files.
There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.
It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.
If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!
Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.