How to handle character encoding in PHP - Codeigniter?

How to handle character encoding in PHP - Codeigniter? - php

What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.

Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.

Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.

1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?

If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );

I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");

EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

Related

Removing unicode bullet character

I'm having an issue that i believe is related to unicode text. When the user enters a string that has the unicode bullet character, mysql is not able to save that field (the rest of the update query works though). Here's how i've been trying to deal with it.
$str = "· Close up the server";
$str = preg_replace("\u2022", "•", $str);
...however this is still not working.

So many things can go wrong here, because database, form submits and source code string literals are all involved. I'll assume you want to use UTF-8, because with any other typical encoding (CP1252, Latin1) you'll be screwed when you want to use json_ or accept more than ~200 different characters.
The first thing to do is remove any kind of conversion etc code that was written with the intention of trying to fix encoding issues. Such as utf8_encode, htmlentitites, *_replace.. whatever.
Source encoding.
$str = "· Close up the server";
When writing the above, the PHP source file needs to be physically encoded in UTF-8. If you are on Windows, you must explicitly do or configure this. UTF-8 doesn't happen magically on Windows.
Form submits
When user submits a form, the payload will be in whatever encoding you declared the page to be. You can declare it like so:
header("Content-Type: text/html; charset=utf-8");
But anyone can actually submit arbitrary bytes to your server, so you should validate the input is in UTF-8 before proceeding. mb_check_encoding is good.
Database
Since at this point your data is coming in as UTF-8, your input strings are in UTF-8. You must specify this after connecting to the database, by specifying a connection encoding.
mysql_set_charset("utf8"); //After making the connection, and before any queries
//or $mysqli->set_charset( "utf8");
This makes the database read your input in UTF-8, and encode its output in UTF-8. You would also want to set your columns/tables/databases to UTF-8 as well.
Unicode escape sequences \uxxxx or \uhhhh\ullll or \Uxxxxxxxx are not supported in PHP.

\u2022 is the UTF-16 hex encoding for "Bullet". Not UTF-8.
You might also want to SET NAMES 'UTF-8'; or change charset before you open your database.

How do you know what encoding the user is inputing into the browser?

I read Joel's article about character sets and so I'm taking his advice to use UTF-8 on my web page and in my database. What I can't understand is what to do with user input. As Joel says, "It does not make sense to have a string without knowing what encoding it uses." But how do I know what encoding the user input string uses? If I have
<input type="text" name="atextfield" >
on my page, how do I know what encoding I'm getting from the user? What if the user puts in some special ASCII symbol, like ♣ or ™ or something? Is there some way I can detect that user input gave me something unrecognized in UTF-8? Is there some standard for how to handle this sort of thing?

Check the HTTP headers to discover the character encoding.

If your web-page using UTF-8, browser will convert to UTF-8 for you. So, even the special characters are in ASCII it will submit as UTF-8.
However, you never know itchy hand from an user that switch back the page encoding to ISO-8859-*.
You can make use on mb_detect_encoding, but is not 100% bullet-proof.
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);
/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");
/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");
/* Use array to specify encoding_list */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);

Don't try to detect, convert all user-inputed text to UTF-8 in your application. You can do all you can on your side, by configuring your webserver to send UTF-8 pages and UTF-8 headers, configure your application to handle all text in UTF-8, tweak your filesystem (if necessary) to handle text files as UTF-8, configure your database, but you simply have no real control on the user end. You can suggest the proper character encoding in your html forms, like the following, but it's not really enforceable on the user end:
<form action="/index.php" method="post" accept-charset="UTF-8"></form>
Unless detecting the encoding of the user input is the whole purpose of your application, it's a fools errand to try. Assume the encoding is wrong and convert it to UTF-8 in your app. Just as you should assume your user input is malicious and clean it up before you attempt to insert it into your database.
In most languages that have UTF-8 properly implemented, ASCII characters will survive conversion, so don't worry about that either.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

How to make internal processing encoding change to UTF8 in PHP?

Currently in my application the utf8 encoded data is spoiled by internal coding of PHP.
How to make it consistent with utf8?
EDIT:To show examples,please tell me how to output the current internal encoding in PHP?
In php.ini I found the following:
default_charset = "iso-8859-1"
Which means Latin1.
How to change it to utf8,say,what's the iso version of utf8?

Change it to:
default_charset = "utf-8"
There is no ISO version of UTF-8.
You'll need to be specific with the details since encoding can be mangled at many different areas in your PHP application.
The common problem areas are:
Saving and retrieving from DB:
The database encoding must the same as the strings sent to it from PHP, or you must convert the strings to the DB encoding.
PHP4's single byte string functions:
PHP's functions such as strlen(), str_replace() do not produce the correct results on multibyte encodings such as UTF-8, since they operate on single bytes.
Page encoding:
Make sure the browser knows you are sending it UTF-8.

You can change the character encoding in php file. To change encoding in php page use the following function.
$new_value = htmlentities('$old_value',ENT_COMPAT, "UTF-8");
and also you can add the following in the html head section
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I hope this will help to solve your problem.

PHP character encoding problems

I need help with a character encoding problem that I want to sort once and for all. Here is an example of some content which I pull from a XML feed, insert into my database and then pull out.
As you can not see, a lot of special html characters get corrupted/broken.
How can I once and for all stop this? How am I able to support all types of characters, etc.?
I've tried literally every piece of coding I can find, it sometimes corrects it for most but still others are corrupted.

To absolutely once and for all make sure you will never have problems with encoding again:
Use UTF-8 everywhere and on everything!
That is (if you use mysql and php):
Set all the tables in your database to collation "utf8_general_ci" for example.
Once you establish the database connection, run the following SQL query: "SET NAMES 'utf8'"
Always make sure the settings of your editor are set to UTF-8 encoding.
Have the following meta tag in the section of your HTML documents:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
And couple of bonus tips:
When you use PHP for string manipulation, use the multibyte functions.
You might check http://docs.kohanaphp.com/core/utf8 as well at some point.
OR:
You can just use one simple server side configuration file that takes care of all encoding stuff. In this case you wont need header and/or meta tags at all or php.ini file modification. Just add your wanted character set encoding to .htaccess file and put it into your www root. If you want to fiddle with character set strings and use your php code for that - thats another story. Database collation must ofcourse be correct.
Footnote: UTF-8 is not the encoding solution its an a solution. It doesn't matter what character set/encoding one is using as long as the used environment has been taking to consideration.

My favorite article about encodings from JoelOnSoftware: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

It seems that an UTF-8 encoded text is interpreted with ISO 8859-1.
If you’re processing XML documents, you have to use the encoding given either in the charset parameter in HTTP header field Content-Type or in the encoding attribute in the XML declaration. If none of both is given, the XML specification declares UTF-8 or UTF-16 as the default character encoding and you have to use some detection.

It looks like the link you gave has data that is encoded in utf-8. (Follow that link, then change the encoding of your browser to utf-8).
I sounds like you are having problems with inserting and retrieving from your database. Make sure your database table has utf-8 set as the encoding.

After you connect to the database, but before you do any transactions, execute the following line which makes sure all database communication is in UTF-8:
mysql_query("SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'", $dbconn);

First off, make sure your database's character encoding is set to support UTF-8. Secondly, PHP's ICONV is going to be your friend. Finally, ensure that your response headers are sending the proper character encoding (again, UTF-8).

header('Content-type: text/html; charset=UTF-8') ;
/**
* Encodes HTML safely for UTF-8. Use instead of htmlentities.
*
* #param string $var
* #return string
*/
function html_encode($var)
{
return htmlentities($var, ENT_QUOTES, 'UTF-8');
}
Those two rescued me and I think it is now working. I'll come back if I continue to encounter problems. Should I store it in the DB, eg as "&" or as "&"?

Did you try utf8_encode() and utf8_decode()?
Which one you use will depend entirely on how your data is encoded, which you don't specify, but they are quite useful for this kind of cases.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.