Is it a good practice to use mb_convert_encoding function

Is it a good practice to use mb_convert_encoding function - php

This question is different from UTF-8 all the way through as it asks for how safe and is it a good practice to use the mb_convert_encoding function.
Lets say that a user can upload the files using the PHP API. Each filename and path gets stored in a PostgreSQL database table which has UTF-8 as default encoding.
Sometimes user uploads files which names aren't UTF-8 encoded and they get imported into the database. The problem is that the characters that are not UTF-8 encoded are scrambled and do not display as they should in the table columns.
I was thinking of adding the following to the PHP code before import:
if ( ! mb_check_encoding($output, 'UTF-8') {
$output = mb_convert_encoding($content, 'UTF-8');
}
Does this look like a good practice and will it be displayed and converted by the user's client correctly if I return UTF-8 as the output? Is there a potential loss to the bytes by using mb_convert_encoding?
Thanks

If you're going to convert an encoding, you need to know what you're converting from. You can check whether the encoding is or isn't valid UTF-8, but if it tells you it's not valid UTF-8 then you still have no clue what it is. Omitting the $from_encoding parameter from mb_convert_encoding just makes it assume some preset encoding for that parameter, but that doesn't mean that $content actually is in that encoding.
In other words: if you don't know what encoding a string is in, you cannot meaningfully convert it to anything else either, and just trying to convert it from ¯\_(ツ)_/¯ is a crapshoot with the result being equally likely to be something useful and utter garbage.
If you encounter unknown encodings, you only have a few choices:
Reject the input value.
Test whether it's one of a handful of other expected encodings and then explicitly convert from your best guess; but that is pretty much a crapshoot as well.
Just use bin2hex or something similar on the value, essentially giving up on trying to interpret it correctly, but still leaving some semblance to the original value.

Related

Invalid UTF-8 sequence handling in PHP

I have a bunch of user-supplied data that I do minimal processing with, such as escaping characters with htmlentities(). Unfortunately that data may be one of a few different encodings (yes, that's something that should have been canonicalized to UTF-8 before, but now there are lots of terabytes of data and it's hard to re-mediate).
Recently I was rather surprised when certain documents refused to display even though the data was definitely there with no log errors or exceptions. After some debugging it looks like (from phpsh):
php> var_dump(htmlentities("Hello\xbdWorld", ENT_COMPAT, 'UTF-8'));
string(0) ""
php> var_dump(error_get_last());
NULL
I am aware that the problem here is that the data is actually ISO-8859-1 encoded, and that I told htmlentities() to treat it as UTF-8 (I'm working on converting everything to UTF-8 but that will take very long). My problem is just that the error handling is so bizarre (non-existent). Tracking down these issues becomes nightmarish. Is there a way that's built into PHP (e.g., a configuration variable or something) to make it so that this does something less surprising than return an empty string in an error state?
If not, I'm thinking of redefining the offending function(s) using override_function() or something to call the function and ensure the return value makes sense, and if not, throw an exception. I found a list of dangerous functions on this very helpful page

Converting your ISO-8859-1 data to UTF8 is actually not something that will take a long time. You can automate the process in php by looping the utf8_encode() function. That function may also be very useful to you for addressing your current issue with displaying ISO data in a UTF8 document.

PHP for Python Programmers: UTF-8 Issues

I have an open source PHP website and I intend to modify/translate (mostly constant strings) it so it can be used by Japanese users.
The original code is PHP+MySQL+Apache and written in English with charset=utf-8
I want to change, for example, the word "login" into Japanese counterpart "ログイン" etc
I am not sure whether I have to save the PHP code in utf-8 format (just like Python)?
I only have experience with Python, so what other issues I should take care of?

If it's in the file, then yes, you will need to save the file as UTF-8.
If it's is in the database, you do not need to save the PHP file as UTF-8.
In PHP, strings are basically just binary blobs. You will need to save the file as UTF-8 so the correct bytes are read in. In theory, if you saved the raw bytes in an ANSI file, it would still be output to the browser correctly, just your editor would not display it correctly, and you would run the risk of your editor manipulating it incorrectly.
Also, when handling non-ANSI strings, you'll need to be careful to use the multi-byte versions of string manipulation functions (str_replace will likely botch a utf-8 string for example).

If the file contains UTF-8 characters then save it with UTF-8. Otherwise you can save it in any format. One thing you should be aware of is that the PHP interpreter does not support the UTF-8 byte order mark so make sure you save it without that.

I'm sorry you have to use PHP after using Python.
PHP has no concept of character sets: all strings are binary, even in parsed php code, so if you include a UTF-8 multibyte character in a php string, make sure the bytes in the code file are UTF-8 bytes.
You will need to be extremely careful with the use of string functions at all levels of your application. You also need to make sure your MySQL connection is set to use UTF-8 (using SET NAMES or the charset dsn parameter in later versions of PDO), and that your mysql string column datatypes use utf-8 storage.

Global utf8_encode

Is there a way to apply utf8_encode globally? I'm trying to get the entire page to load different languages properly without having to insert utf8_encode before every variable.
echo '<p>'.utf8_encode($LANG_CALL_TO_ACTION).'</p>'

The fact that you have to use utf8_encode this frequently is probably a symptom of an architectural problem.
utf8_encode is a function that converts iso-8859-1 encoded data to UTF-8.
In a modern setup, you won't need to use it at all: The incoming data will already be UTF-8 encoded.
If it is not, if the data comes from a legacy ISO-8859-1 database (which is totally okay), you should use an appropriate output encoding instead of UTF-8 (i.e. in this case ISO-8859-1).
It is also possible to globally convert the data if really, really necessary, but to give any advice on that, we'd need to know much more about your setup. It also is probably a bad idea to do.

You might not have to. utf8_encode() is for converting an ISO-8859-1 string into a UTF8 encoded one - it does you no good if your data is already UTF-8 and you're just having display issues.
What it seems to me like you're after is to make sure all of your contents is displayed correctly as UTF-8, for which you just need to set the proper HTTP header.
My preferred method for doing so is this
ini_set( 'default_charset', 'UTF-8' );

___ encoding to UTF-8 - is there an end-all solution?

I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?

No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php

Do I need to make sure output data is valid UTF-8?

I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?

First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.
Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.
As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.

Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.
In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.
OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.
Cheers,

Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.

If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.
The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '<', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.
The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.