I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);
but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).
I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").
But there must be something that at least has a good try!
What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.
However, you could try doing this:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
Setting it to strict might help you get a better result.
In motherland Russia we have four popular encodings, so your question is in great demand here.
Only by character codes of symbols you can not detect the encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.
The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".
One guy here in a popular Russian tech blog invented this approach:
Build the probability range of character codes in every encoding you want to support. You can build it using some big texts in your language (e.g., some fiction, use Shakespeare for English and Tolstoy for Russian, LOL). You will get something like this:
encoding_1:
190 => 0.095249209893009,
222 => 0.095249209893009,
...
encoding_2:
239 => 0.095249209893009,
207 => 0.095249209893009,
...
encoding_N:
charcode => probabilty
Next, you take text in an unknown encoding and for every encoding in your "probability dictionary" you search for the frequency of every symbol in the unknown-encoded text. Sum the probabilities of symbols. Encoding with the bigger rating is likely the winner. There are better results for bigger texts.
Btw, mb_detect_encoding certainly does not work. Yes, at all. Please, take a look of the mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".
Just use the mb_convert_encoding function. It will attempt to autodetect character set of the text provided or you can pass it a list.
Also, I tried to run:
$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);
and the results are the same for both.
There is no way to identify the character set of a string that is completely accurate.
There are ways to try to guess the character set. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding. This will scan your string and look for occurrences of stuff unique to certain character sets. Depending on your string, there may not be such distinguishable occurrences.
Take the ISO-8859-1 character set vs ISO-8859-15.
There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing its encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know its exact character set.
(Note: you could add a human factor, or an even more advanced scanning technique (e.g., what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far.)
There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.
Interesting read: How do I determine the charset/encoding of a string?
There are other ways of ensuring the correct character set though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure your submission will be UTF-8 in every browser: Rails and Snowmen)
That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the Unix 'file -i' command on it through, e.g., exec() (if possible on your server) to aid the detection (using the document's BOM).
Concerning scraping data, you could read the HTTP headers, that usually specify the character set. When parsing XML files, see if the XML meta-data contain a charset definition.
Rather than trying to automagically guess the character set, you should first try to ensure a certain character set yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.
There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.
My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:
Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
If encoding cannot be detected, throw new RuntimeException
If input is UTF-8, carry on.
Else, if it is ISO-8859-1 or ASCII
a. Attempt conversion to UTF-8 (wait, not finished)
b. Detect the encoding of the converted value
c. If the reported encoding and converted value are both UTF-8, carry on.
d. Else, throw new RuntimeException
From my abstract class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.
People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in its full range.
If it's a file you won't save it UTF-8 encoded into the database, but in binary form. When you output the file again, use binary output as well, then this is totally transparent.
Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.
So I must admit I don't see a specific issue you raise with your question.
It seems that your question is quite answered, but I have an approach that may simplify you case:
I had a similar issue trying to return string data from MySQL, even configuring both database and PHP to return strings formatted to UTF-8. The only way I got the error was actually returning them from the database.
Finally, sailing through the web I found a really easy way to deal with it:
Giving that you can save all those types of string data in your MySQL in different formats and collations, you only need to, right at your php connection file, set the collation to UTF-8, like this:
$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");
Which means that first you save the data in any format or collation and you convert it only at the return to your PHP file.
If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)
enca also came up here: How to find encoding of a file in Unix via script(s)
There are a couple of libraries out there. onnov/detect-encoding looks promising. It claims to do better than mb_detect_encoding
Example usage for converting string in unknown character encoding to UTF-8:
use Onnov\DetectEncoding\EncodingDetector;
$detector->iconvXtoEncoding('Проверяемый текст')
To simply detect encoding:
$encoding = $detector->getEncoding('Проверяемый текст');
You could set up a set of metrics to try to guess which encoding is being used. Again, it is not perfect, but it could catch some of the misses from mb_detect_encoding().
Because the usage of UTF-8 is widespread, you can suppose it being the default, and when not, try to guess and convert the encoding. Here is the code:
function make_utf8(string $string)
{
// Test it and see if it is UTF-8 or not
$utf8 = \mb_detect_encoding($string, ["UTF-8"], true);
if ($utf8 !== false) {
return $string;
}
// From now on, it is a safe assumption that $string is NOT UTF-8-encoded
// The detection strictness (i.e. third parameter) is up to you
// You may set it to false to return the closest matching encoding
$encoding = \mb_detect_encoding($string, mb_detect_order(), true);
if ($encoding === false) {
throw new \RuntimeException("String encoding cannot be detected");
}
return \mb_convert_encoding($string, "UTF-8", $encoding);
}
Simple, safe and fast.
If the text is retrieved from a MySQL database, you may try adding this after the database connection.
mysqli_set_charset($con, "utf8");
mysqli::set_charset
Related
I need to automatically determine the character encoding of strings from email content and headers. For the most part this isn't an issue however there is an occasional email with content and/or a header that has an oddball character such as an en dash. Now I received an answer that technically seems to work if I statically test it on a specific header for a specific email however that blatantly ignores the fact that importing email needs to be a completely automated process in which case I am utterly unable to automatically determine the string's character encoding.
I've started with the basics such as detecting common trouble characters that seem to guarantee a character encoding issue will occur. However strpos('en dash: –', '–') works fine while intentionally / manually testing though it fails outright when added directly to the automated process. I'm going to guess that the issue there is that the string parameters have a UTF-8 encoding while the automated process is testing a string that isn't yet UTF-8 and thus internally the same character isn't using the same subset of code (via character encoding).
So my second attempt was mb_detect_encoding's second parameter can be an array. So I tried the following:
$encodings = array('UTF-8','UCS-4','UCS-4BE','UCS-4LE','UCS-2','UCS-2BE','UCS-2LE','UTF-32','UTF-32BE','UTF-32LE','UTF-16','UTF-16BE','UTF-16LE','UTF-7','UTF7-IMAP','ASCII','EUC-JP','SJIS','eucJP-win','SJIS-win','ISO-2022-JP','ISO-2022-JP-MS','CP932','CP51932','SJIS-mac','SJIS-Mobile#DOCOMO','SJIS-Mobile#KDDI','SJIS-Mobile#SOFTBANK','UTF-8-Mobile#DOCOMO','UTF-8-Mobile#KDDI-A','UTF-8-Mobile#KDDI-B','UTF-8-Mobile#SOFTBANK','ISO-2022-JP-MOBILE#KDDI','JIS','JIS-ms','CP50220','CP50220raw','CP50221','CP50222','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','byte2be','byte2le','byte4be','byte4le','BASE64','HTML-ENTITIES','7bit','8bit','EUC-CN','CP936','GB18030','HZ','EUC-TW','CP950','BIG-5','EUC-KR','UHC','ISO-2022-KR','Windows-1251','Windows-1252','CP866','KOI8-R','KOI8-U','ArmSCII-8');
$encoding = mb_detect_encoding($s, $encodings, true);
$compare = mb_convert_encoding($s, 'UTF-8', $encoding);
foreach ($encodings as $k1)
{
if (mb_convert_encoding($s, 'UTF-8', $k1) === $s) {$encoding = $k1; break;}
}
Unfortunately that seemed to result in the same failure based on what I presume was the same underlying issue.
So my third idea I'm looking for some more experienced validation. I could convert the string down in its binary form (ones and zeroes, not binary data). Then I could try converting the string and then converting that second string to binary to compare the two binary versions; if they === match then I might have determined the correct character encoding?
Now I can easily try this with this answer from an unrelated thread however I'm not certain if this is a valid idea or not. This is all intended to answer my question:
How can I determine the actual character encoding of a string in order to convert it to UTF-8 with fully automated validation without corrupting data?
By validation I'm talking about stuff like comparing the binary data though again, I'm not certain if that is a valid approach or not. I do know that I absolutely hate en dashes though.
The answer won't change: it's impossible. You have to rely on external information which encoding is used on text.
Guessing an encoding can horribly go wrong:
Based on the order in which you test against it can either turn out as i.e. ASCII or UTF-8 or Windows-1252, just because so far it fits in. Your list is questionable, because it may match Base64 which is not even a text encoding.
If the source is not properly encoded itself then guessing its encoding will most likely exclude the correct one. And guess a wrong one. Which makes things worse.
Many encodings share the same area: the source can either fit i.e. Windows-1252 or Windows-1251 and even detecting the lexical sense of the text cannot guarantee which of both is correct.
Also: ones and zeroes are binary. PHP strings are only byte arrays, so they're binary to begin with. How they're interpreted relies on you: if your code is $text= "グリーン"; then it's up to which encoding your PHP text file has and how your PHP defaults are set. There is no "internal ... character", only bytes. Which is also the reason why there are functions which operate on bytes (i.e. strlen()) and on a specific text encoding (i.e. mb_strlen()).
If you hate single characters or not: they can be easily used as what they are: characters in texts. And – has its own valid meaning in contrast to — and ‒ and -; don't replace it by personal opinion, because that could corrupt a context's meaning. It's like ignoring the fact that A and Α and A are all different characters. You might want to look up the difference between homoglyphs and synoglyphs - the latter is your current perspective.
You may ask "And in which encoding does PHP interpret the scripts?" Luckily ASCII is for most encodings the most common denominator, so interpreting the first bytes of a file as such to search for <?php (all these are ASCII characters, so for PHP code itself it doesn't matter if it is effectively UTF-8 or ISO-8859-1 or Shift-JIS) will only fail when the document is encoded in i.e. UTF-16 - in that case you must set your PHP defaults to that encoding. Which again proves: text encodings must be told outside of the text.
Since our PHP code is running on different environments, we do not control (and we don't know the encoding of), the idea is to not use any none-ASCII characters in the source code.
However there are a few places in the code, where string literals are defined that contain none-ASCII characters, like 'TextWithÜ'.
Is there a way to write the 'Ü' using ASCII only?
The best I can think of is to use HTML-notation and decode it.
html_entity_decode('TextWithÜ');
However, since we do not know the systems default encoding, I would have to detect that as well:
html_entity_decode('TextWithÜ', ENT_COMPAT | ENT_HTML401, ini_get('default_charset'));
And html_entity_decode supports only a subset of ini_get('default_charset') which is why that might fail sometimes.
Is there a better way?
If you're shipping the source code files, you do control their encoding. If you save your files in UTF-8 encoding, all string literals inside that file will be UTF-8 encoded. One would have to purposefully convert the encoding of the file to change that, that hardly happens by accident or some misconfiguration.
If you're still concerned about this, the best way is probably to express the strings directly as bytes:
$str = "TextWith\xC3\x9C"; // "Ü"
This will be somewhat cumbersome to both write and read, but is the most direct way to system-agnostically produce strings with content in a specific encoding.
Assuming you're running your files as included files in another app, and your concern is that you don't know what encoding that other app expects, you would create an "encoding sandwich". Your code is in the middle and uses one standardised encoding (preferably UTF-8), with the "edges" converting to and from whatever the other surrounding code expects. That means you need defined borders, defined functions which the other code interacts with. On all input points, you do something like:
function take_input($input) {
$input = iconv(App::externalEncoding(), 'UTF-8', $input);
...
}
At all points which return data to other code, you'd do:
function return_output() {
...
return iconv('UTF-8', App::externalEncoding(), $output);
}
From the other app's point of view, that would look something like:
require_once 'JochensCode.php';
App::externalEncoding('SJIS');
take_input('文字化け');
echo return_output();
I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);
but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/
For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).
I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").
But there must be something that at least has a good try!
What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.
However, you could try doing this:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
Setting it to strict might help you get a better result.
In motherland Russia we have four popular encodings, so your question is in great demand here.
Only by character codes of symbols you can not detect the encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.
The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".
One guy here in a popular Russian tech blog invented this approach:
Build the probability range of character codes in every encoding you want to support. You can build it using some big texts in your language (e.g., some fiction, use Shakespeare for English and Tolstoy for Russian, LOL). You will get something like this:
encoding_1:
190 => 0.095249209893009,
222 => 0.095249209893009,
...
encoding_2:
239 => 0.095249209893009,
207 => 0.095249209893009,
...
encoding_N:
charcode => probabilty
Next, you take text in an unknown encoding and for every encoding in your "probability dictionary" you search for the frequency of every symbol in the unknown-encoded text. Sum the probabilities of symbols. Encoding with the bigger rating is likely the winner. There are better results for bigger texts.
Btw, mb_detect_encoding certainly does not work. Yes, at all. Please, take a look of the mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".
Just use the mb_convert_encoding function. It will attempt to autodetect character set of the text provided or you can pass it a list.
Also, I tried to run:
$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);
and the results are the same for both.
There is no way to identify the character set of a string that is completely accurate.
There are ways to try to guess the character set. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding. This will scan your string and look for occurrences of stuff unique to certain character sets. Depending on your string, there may not be such distinguishable occurrences.
Take the ISO-8859-1 character set vs ISO-8859-15.
There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing its encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know its exact character set.
(Note: you could add a human factor, or an even more advanced scanning technique (e.g., what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far.)
There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.
Interesting read: How do I determine the charset/encoding of a string?
There are other ways of ensuring the correct character set though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure your submission will be UTF-8 in every browser: Rails and Snowmen)
That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the Unix 'file -i' command on it through, e.g., exec() (if possible on your server) to aid the detection (using the document's BOM).
Concerning scraping data, you could read the HTTP headers, that usually specify the character set. When parsing XML files, see if the XML meta-data contain a charset definition.
Rather than trying to automagically guess the character set, you should first try to ensure a certain character set yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.
There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.
My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:
Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
If encoding cannot be detected, throw new RuntimeException
If input is UTF-8, carry on.
Else, if it is ISO-8859-1 or ASCII
a. Attempt conversion to UTF-8 (wait, not finished)
b. Detect the encoding of the converted value
c. If the reported encoding and converted value are both UTF-8, carry on.
d. Else, throw new RuntimeException
From my abstract class Sanitizer
private function isUTF8($encoding, $value)
{
return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
}
private function utf8tify(&$value)
{
$encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
mb_internal_encoding('UTF-8');
mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
mb_detect_order($encodings);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if (!$stringEncoding) {
$value = null;
throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
}
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
$stringEncoding = mb_detect_encoding($value, $encodings, true);
if ($this->isUTF8($stringEncoding, $value)) {
return;
} else {
$value = null;
throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
}
}
return;
}
One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.
People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.
The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.
I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in its full range.
If it's a file you won't save it UTF-8 encoded into the database, but in binary form. When you output the file again, use binary output as well, then this is totally transparent.
Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.
So I must admit I don't see a specific issue you raise with your question.
It seems that your question is quite answered, but I have an approach that may simplify you case:
I had a similar issue trying to return string data from MySQL, even configuring both database and PHP to return strings formatted to UTF-8. The only way I got the error was actually returning them from the database.
Finally, sailing through the web I found a really easy way to deal with it:
Giving that you can save all those types of string data in your MySQL in different formats and collations, you only need to, right at your php connection file, set the collation to UTF-8, like this:
$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");
Which means that first you save the data in any format or collation and you convert it only at the return to your PHP file.
If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)
enca also came up here: How to find encoding of a file in Unix via script(s)
There are a couple of libraries out there. onnov/detect-encoding looks promising. It claims to do better than mb_detect_encoding
Example usage for converting string in unknown character encoding to UTF-8:
use Onnov\DetectEncoding\EncodingDetector;
$detector->iconvXtoEncoding('Проверяемый текст')
To simply detect encoding:
$encoding = $detector->getEncoding('Проверяемый текст');
You could set up a set of metrics to try to guess which encoding is being used. Again, it is not perfect, but it could catch some of the misses from mb_detect_encoding().
Because the usage of UTF-8 is widespread, you can suppose it being the default, and when not, try to guess and convert the encoding. Here is the code:
function make_utf8(string $string)
{
// Test it and see if it is UTF-8 or not
$utf8 = \mb_detect_encoding($string, ["UTF-8"], true);
if ($utf8 !== false) {
return $string;
}
// From now on, it is a safe assumption that $string is NOT UTF-8-encoded
// The detection strictness (i.e. third parameter) is up to you
// You may set it to false to return the closest matching encoding
$encoding = \mb_detect_encoding($string, mb_detect_order(), true);
if ($encoding === false) {
throw new \RuntimeException("String encoding cannot be detected");
}
return \mb_convert_encoding($string, "UTF-8", $encoding);
}
Simple, safe and fast.
If the text is retrieved from a MySQL database, you may try adding this after the database connection.
mysqli_set_charset($con, "utf8");
mysqli::set_charset
I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.
I'm working on a application which supports several languages and has a functionality in place which tries to use the language requested by the browser and also allows manual override of this function. This part works fine and picks the correct templates, labels, etc.
User have to enter sometimes text on their own and that's where I run into issues because the application has to accept even "complicated" languages like Chinese and Russian. So far I've taken care of the things mentioned in other posting, i.e.:
calling mb_internal_encoding( 'UTF-8' )
setting the right encoding when rendering the webpages with meta http-equiv=Content-Type content=text/html;charset=UTF-8 (format adapted due to stackoverflow limitations)
even the content arrives correctly, because mb_detect_encoding() == UTF-8
tried to set setLocale(LC_CTYPE, "UTF-8"), which doesn't seem to work because it requires the selection of one language, which I can't specify because I have to support several. And it still fails if I force it manually for testing purposes, i.e. with; setLocale(LC_CTYPE,"zh__CN.utf8") - ctype_alpha() would still fail for Chinese text
It seems that even explicit language selection doesn't make ctype_alpha() useful.
Hence the question is: how should I check for alphabetic characters in all languages?
The only idea I had at the moment is to check manually with arrays of "valid" characters - but this seems ugly especially for Chinese.
How would you solve this issue?
If you'd like to check only for valid unicode letters regardless of the used language I'd propose to use a regular expression (if your pcre-regex extension is built with unicode support):
// adjust pattern to your needs
// $input needs to be UTF-8 encoded
if (preg_match('/^\p{L}+$/u', $input)) {
// OK
} else {
// not OK
}
\p{L} checks for unicode characters with the L(etter) property which includes the properties Ll (lower case letter), Lm (modifier letter), Lo (other letter), Lt (title case letter) and Lu (upper case letter) - from: Regular Expression Details).
I wouldn't use an array of characters. That would get impossible to manage.
What I'd suggest is working out a 'default' language from the IP address and using that as the locale for a request. You could also get it from the browser-agent string in some cases. You could provide the user a way to override so that if your default isn't correct they aren't stuck with a strange site. (E.g. provide on the form 'language set to english. If this isn't correct, please change: '. This isn't the nicest thing to provide but you won't get any working validation otherwise as you NEED a language/locale set in order to have a sensible alpha validation (An A isn't a letter in chinese).
You can use the languages from
$_SERVER['HTTP_ACCEPT_LANGUAGE']
It contains something like
de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
so you need to parse this string. Then you can use the preferred language in the setLocale function.
This is rather an encoding issue than a language detection issue. Because UTF-8 can encode any Unicode character.
The best approach is to use UTF-8 throughout your project: in your database, in your output and as expected encoding for the input.
Output Make sure you encode your data with UTF-8 and declare that in the HTTP header in the Content-Type field and not just in the document itself.
Input If you’re using forms, declare the expected encoding in the accept-charset attribute.