I am trying to convert my existing PHP webpage to use UTF-8 encoding.
To do so, I have done the following things:
specified UTF-8 as the charset in the meta content tag at the start of my webpage.
change the default_charset to UTF-8 in the php.ini.
specified UTF-8 as the iconv encoding in the php.ini file.
specified UTF-8 in my .htaccess file using: AddDefaultCharset UTF-8.
Yet after all that, when i echo mb_internal_encoding(), it shows as ISO-8859-1. What am I missing here? I know I could use auto_prepend to attach a script that changes the default encoding to UTF-8, but I'm just trying to understand what I'm missing.
Thanks
mb_internal_encoding() doesn't effect the output of your scripts per se, it effects the default encoding when using the multibyte string functions and the conversion of POST and GET inputs.
Simply set with
mbstring.internal_encoding='UTF-8'
in your php.ini file, or programmatically with:
mb_internal_encoding('UTF-8');
Speaking of the mb_ functions, you'll need to rewrite your scripts to use these, e.g. mb_strlen() instead of strlen.(), etc.
Also check what HTTP content-type headers are being outputted, though from what you've done it should be ok.
If you using a database, you'll also have to convert that too, and specify that you're using UTF-8 when connecting to it.
The documentation states that you can SET that variable using
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
which should get rid of your problem :)
Related
I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...
What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.
Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.
1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?
If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );
I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");
EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';
UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.
How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?
Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example?
(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).
Some useful options to have in .htaccess:
########################################
# Locale settings
########################################
# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"
SetEnv LC_ALL nl_NL.UTF-8
########################################
# Set up UTF-8 encoding
########################################
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
php_value default_charset "UTF-8"
php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"
php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6
# See also php functions:
# mysql_set_charset
# mysql_client_encoding
# database settings
#CREATE DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#
#ALTER DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#ALTER TABLE tbl_name
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# ;
You're right UTF-8 is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Handling character encodings in HTML and CSS
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.
PHP Settings
As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
default_charset - PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well <meta http-equiv="Content-type" content="text/html;charset=UTF-8">.
output_handler - This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb, iconv) can have it's own encoding settings (see Strings).
Strings
StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b"binary string";.
mb_internal_encoding()Docs - Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
iconv_set_encoding()Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you're looking for UTF-8. Other functions like html_entity_decodeDocs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.
To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.
All your files have to be saved in UTF-8 (without BOM) using your code editor.
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
header('Content-Type: text/html; charset=utf-8');
Add HTML meta content-type:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use htmlspecialchars() instead of htmlentities() because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.
Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte String or other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
For regular expressions use u modifier. For example:
preg_match('/ž{3,5}/u', $string, $matches);
Together this is the most reliable way to check if the given string is valid utf-8 string:
if (#preg_match('//u', $string) === false) {
// NOT valid!
} else {
// Valid!
}
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
mysql_set_charset('utf8', $link);
Also check if columns in the database are in utf-8. It's not always needed but recomended.
Basically I do three things to work correctly with czech language:
1) define locale in PHP:
setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");
so you would use something like:
setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");
based on language which is currently switched to.
2) define charset for the database:
mysql_query("set names latin2 collate latin2_czech_cs");
3) define the charset of PHP/HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8 (based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.
Try one of the following:
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
When I create file on Windows hosting, it gets name like джулия.jpg
It has to be a cyrillic name.
fopen() is used for creation.
What can I do with this?
It's an encoding issue.
Setting PHP to use UTF-8 encoding will probably suffice: http://php.net/manual/en/function.utf8-encode.php
UTF-8 can represent every character in the Unicode character set, plus it has the special property of being backwards-compatible with ASCII.
Check if all the script files use the same encoding (ANSI, ISO-..., UTF-8, etc).
Check the internal encoding your script use and the encoding of the string
multibyte functions
internal encoding of your script
encoding of your string
NB: Not recommending you to use string input from websites in the filesystem!
But if you expect input in a certain format, be sure to specify the content type of your html page.
Currently in my application the utf8 encoded data is spoiled by internal coding of PHP.
How to make it consistent with utf8?
EDIT:To show examples,please tell me how to output the current internal encoding in PHP?
In php.ini I found the following:
default_charset = "iso-8859-1"
Which means Latin1.
How to change it to utf8,say,what's the iso version of utf8?
Change it to:
default_charset = "utf-8"
There is no ISO version of UTF-8.
You'll need to be specific with the details since encoding can be mangled at many different areas in your PHP application.
The common problem areas are:
Saving and retrieving from DB:
The database encoding must the same as the strings sent to it from PHP, or you must convert the strings to the DB encoding.
PHP4's single byte string functions:
PHP's functions such as strlen(), str_replace() do not produce the correct results on multibyte encodings such as UTF-8, since they operate on single bytes.
Page encoding:
Make sure the browser knows you are sending it UTF-8.
You can change the character encoding in php file. To change encoding in php page use the following function.
$new_value = htmlentities('$old_value',ENT_COMPAT, "UTF-8");
and also you can add the following in the html head section
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I hope this will help to solve your problem.