I am confused about the behavior of utf8_decode() and just want a little clarification. I hope that's ok.
Here's a simple HTML form that I'm using to capture some text and save it to my MySQL database (which uses the utf8_general_ci collation):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<form action="update.php" method="post" accept-charset="utf-8">
<p>
Title: <input type="text" name="title" id="title" accept-charset="utf-8" size="75" value="" />
</p>
<p>
<input type="submit" name="submit" value="Submit" />
</p>
</form>
</body>
</html>
As you can see I've got this coded up with charset=utf8 in the appropriate places. We accept text that includes diacritics (eg., ñ, ó, etc.). In the end, we run a little script on all text input to check for diacritics and change them to HTML entities (eg., ñ becomes ñ).
When input is received by my script, I first have to do utf8_decode($input) and then run my little script to check for and change diacritics as needed. Everything works fine. I'm curious as to why I have to run the decode on this input. I understand that utf8_decode converts a string encoded in UTF-8 to ISO-8859-1. I want to make sure - even though everything works fine (or so I think) - that I'm not doing something screwy that will catch up to me later. For instance, that I'm sending ISO-8859-1 encoded characters to be stored in my database that is set up to store/serve UTF-8 characters. Should I do something like run utf8_encode() on the string that my diacritics-to-entities script returns? Eg:
$string = utf8_decode($string);
$search = explode(",","À,È,Ì,Ò,Ù,à,è,ì,ò,ù,Á,É,Í,Ó,Ú,Ý,á,é,í,ó,ú,ý,Â,Ê,Î,Ô,Û,â,ê,î,ô,û,Ã,Ñ,Õ,ã,ñ,õ,Ä,Ë,Ï,Ö,Ü,Ÿ,ä,ë,ï,ö,ü,ÿ,Å,å,Æ,æ,ß,Þ,þ,ç,Ç,Œ,œ,Ð,ð,Ø,ø,§,Š,š,µ,¢,£,¥,€,¤,ƒ,¡,¿");
$replace = explode(",","À,È,Ì,Ò,Ù,à,è,ì,ò,ù,Á,É,Í,Ó,Ú,Ý,á,é,í,ó,ú,ý,Â,Ê,Î,Ô,Û,â,ê,î,ô,û,Ã,Ntilde;,Õ,ã,ñ,õ,Ä,Ë,Ï,Ö,Ü,Ÿ,ä,ë,ï,ö,ü,ÿ,Å,å,Æ,æ,ß,Þ,þ,ç,Ç,Œ,œ,Ð,ð,Ø,ø,§,Š,š,µ¢,£,¥,€,¤,ƒ,¡,¿");
$new_input = str_replace($search, $replace, $string);
return utf8_encode($new_input); // right now i just return $new_input.
Appreciate any insight anyone has to offer about this.
Do not use "accept-charset". It's broken. Most browsers have stopped sending it in their own http requests. Some browsers (IE) completely ignore this attribute when they parse a form, and others do a very limited job with it. In practice, the "accept-charset" will do more harm than good.
The convention is that the browser will send the data in the same encoding as it received the form. So make sure your page is sent as UTF-8. Your meta-tag in the HTML's head isn't enough. For a PHP page, this setting can be set in 3 places:
A HTML tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the "head".
An AddDefautCharset UTF8 line in the Apache configuration (or anything similar in other web servers).
A PHP call to header("Content-type=text/html; charset=utf-8"); (before anything is displayed on the page).
Each directive overrides the previous ones. So if your server already declares a charset, your meta tag will be ignored.
So you should:
Make sure your source file is in UTF-8, of course.
Fix your HTML source so that it validates at W3C. For instance, your meta tag should be closed in XHTML.
Remove the "accept-charset" attributes.
Eventually, force the encoding declaration in Apache or with PHP's header().
Ensure in your browser that the HTTP headers received from the server have the right encoding declared (or no encoding if you rely on your meta tag). On Linux curl -I <URL> displays the HTTP headers only.
When submitting a form with accept-charset="utf-8", the browser sends the form data to the server in ISO-8859-1 characters encoded with utf-8. utf8_decode turns the encoded data bact into strict ISO-8859-1. For example, if you submit "ñ", utf-8 encoding will submit "%F1" to your form action, which in turn must be converted back to "ñ" for your script to work.
so will get the page to display the text to display in utf-8, but even if you switch it to utf8 using accept-charset="utf-8" the server concerts it to iso-8859-1 and then when it's displayed it's then converts to utf-8 again from iso-8859-1, but was able to convert a utf-8 only char, so it ends up displaying a weird char and every time you loop through this process it'll get worse and worse, so what I've found is even though you do everything on the html side there isn't a way to switch it on the server for it to read utf-8 and so you can't switch everything to utf-8. That is on apache and if there is a way I'd love to know.
I'm pulling some content from my database and when I display it, I am getting some random characters occasionally dispersed throughout the content. I am seeing a lot of  where spaces were/are. I'm also getting ’ in some places.
The characters don't appear when I view in phpMyAdmin. How do I encode the content correctly? Is it something I should do BEFORE I insert the content or is it something I do when I am displaying?
What character set is the data stored in?
For example, if the data is stored as UTF-8, then when displaying the data, you need to make sure the page encoding is set to UTF-8 as well.
If it is stored in some other character set, then set the page encoding as appropriate.
You can do this by passing appropriate headers:
Content-Type: text/html; charset=utf-8
Or letting the browser know in your document:
<META http-equiv="Content-Type" content="text/html; charset="utf-8">
And in HTML5:
<meta charset="utf-8" />
That's UTF-8 being misinterpreted as CP1252. Make sure all the appropriate headers are in place.
>>> print u'’'.encode('cp1252').decode('utf-8')
’
IMO, the best thing would be to work on utf-8 on your files/database (or at least the same encoding in all places).
Please check what do you have under $db['default']['char_set'] and $db['default']['dbcollat'] on your application/config/database.php and what encoding you are using in your views/html. If you see the data correctly on PMA, then maybe the problem is in your views.
Try to use utf8_encode or utf8_decode when you print your text.
I am new here, so I apologize if I am doing anything wrong.
I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
header('Content-Type:text/html; charset=UTF-8');
<form action="whatever.php" accept-charset="UTF-8">
I even tried:
ini_set('default_charset', 'UTF-8');
When the other page loads, I need to check what the user input with something like:
if ( $_POST['field'] == $check ) {
...
}
But if he inputs something like 'München', PHP will compare 'München' with 'München' and will never trigger TRUE even though it should. Since it is specified UTF-8 everywhere, I am guessing that the server is converting to something else (Windows-1252 as I read on another thread) because it does not support or is not configured to UTF-8. I am using Apache on a local server before I load into production; I have not changed (and don't know how to) any of the default settings. I've been working on a Windows 7, editing with Notepad++ enconding my files in ANSI. If I bin2hex('München') I get '4dc3bc6e6368656e'.
If I echo $_POST['field']; it displays 'München' correctly.
I have researched everywhere for an explanation, all I find is that I should include those tags/headings I already have.
Any help is much appreciated.
You are facing many different problems at the same, let's start with the simplest one.
Problem 1) You say that echo $_POST['field']; will display it correctly? What do you mean with "display"? It can be displayed correctly in two cases:
either the field is in UTF-8 and your page has been declared as UTF-8 and the browser is displaying it as UTF-8 or,
the field is in Latin-1 and the browser has decided (through the auto-detection heuristics) that your page is in Latin-1.
So, the fact that echo $_POST['field']; is correct tells you nothing.
Problem 2) You are using
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
header('Content-Type:text/html; charset=UTF-8');
Is this PHP code? If it is, it will be an error because the header must be set before sending out any byte. If you do this you will not set the Content-Type header and PHP should generate a warning.
Problem 3) You are using
<form action="whatever.php" accept-charset="UTF-8">
Some browsers (IE, mostly) ignore accept-charset if they can coerce the data to be sent in ASCII or ISO Latin-1. So the data will be in UTF-8 and declared as ISO Latin-1 or ISO Latin-1 and sent as ISO Latin-1 (but this second case is not your case).
Have a look at https://stackoverflow.com/a/8547004/449288 to see how to solve this problem.
Problem 4) Which strings are you comparing? For example, if you have
$city = "München"
$_POST['city'] == $city
The result of this code will depend on the encoding of the PHP file. If the file is encoded in ISO Latin-1 and the $_POST correctly contains UTF-8 data, the == will compare different bytes and will return false.
Another solution that may be helpful is in Apache, you can place a directive in your configuration file (httpd.conf) or .htacess called AddDefaultCharset. It looks like this:
AddDefaultCharset utf-8
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
That will override any other default charsets.
I changed "mbstring.detect_order = pass" in my php.ini file and i worked
I've used Unicode characters in my forms and file many times. I had not any problem up to now.
Try to do these steps and check the result:
Remove header('Content-Type:text/html; charset=UTF-8'); from your HTML form codes.
Use your form just like <form action="whatever.php"> without accept-charset="UTF-8". (It's better to insert the method of sending data in your form tag).
In target page (whatever.php), insert again <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in a <head> tag.
I always did my project like what I mentioned here and I did not have any problem with Unicode strings.
This is due to the character encoding of the PHP file(s).
The hardcoded München is stored with the character encoding of the source file(s), in this case ANSI and when that value is compared to the UTF-8 encoded value provided in the $_POST variable, the two will, quite naturally, differ.
The solution to your problem is one of:
Serve and process content with the same encoding as that of the source file(s), in this case likely to be windows-1252.
This would, for starters, include changing the content="text/html; charset=UTF-8" to content="text/html; charset=windows-1252" whenever serving HTML data.
Avoid all hardcoded values that could be affected by character encoding issues between UTF-8 and windows-1252, more or less only hardcode values that only includes English letters and numbers.
Any UTF-8 values would have to be read from a source that ensures they are UTF-8 encoded (for instance a database set to use UTF-8 as storage encoding as well as connection encoding).
Wrap all hardcoded assignments in utf8_encode(), for instance $value = utf8_encode ('München');
Change the encoding of the source file(s) to UTF-8.
This can be accomplished in any number of ways, a decent text editor will be able to do it or the outstanding libiconv can be used, especially for batch processing.
Either solution 1 or 4 would be my preferred solution, especially if multiple people are involved in the project.
As a side-note, some text editors (notably Notepad++) has the option of using either UTF-8 or UTF-8 without BOM. The BOM (Byte Order Mark) is pointless in UTF-8 and will cause problems when writing headers in PHP (most often when doing a redirect). This is because the BOM is right in front of the initial <?php, causing the server to send the BOM just as it would had there been any other character in front. The difference is you'd note a character in front, but the BOM isn't displayed.
Rule of thumb: Always use UTF-8 without BOM.
I have a form on my site where users can submit text as part of a product review. The review goes to a MySQL database, where I can review it before approving it so it appears on my site. I received a review today that was filled with strange characters. For example, I think the below was supposed to come out as "fun" but instead it showed up in my MySQL DB as:
“funâ€Â
I'm pretty sure this is a character encoding issue, and I've read a few entries on stackoverflow about such issues, but I'm just not sure how to implement a fix. I'm guessing I need to change the php function I use to do data cleaning from the form, which is below:
function cleanDataForDB($data) {
$data = trim(htmlentities(strip_tags(nl2br($data),'<br><br />')));
if (get_magic_quotes_gpc())
$data = stripslashes($data);
$data = mysql_real_escape_string($data);
return $data;
}
The html for my site is encoded in UTF-8. I have this tag at the top of every page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Do I need to use a php encoding function, such as utf8_encode() on data entry and utf8_decode() when I'm displaying in a browser?
Any help is greatly appreciated. Thanks!
Chris
It's also good to make sure that the web server is advertising UTF-8, but that's not the culprit here. I use the Live HTTP Headers extension in Firefox to test. MySQL always defaults to the latin-1 character set and you must explicitly set it other wise with mysql_set_charset(). PHP itself it not very good at multi-byte character sets like UTF-8, but as long as it doesn't need to understand those characters (such as regular expression matching) you are safe. You just need to make sure all input and output to the User (via the meta tag) and to the database are aware of the character encoding.
Another utf-8 related problem I believe...
I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.
I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.
However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as è
I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.
Anyone have any other ideas as to where the problem may lie?
Many thanks
Yup.
The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.
But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?
Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the à and ¨ respectively.
So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)
There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.
Well it is your own code convert characters into entities.
To make it right:
Ban htmlentities function from your scripts forever.
Use htmlspecialchars, but not on insert, but whan displaying data.
Repair existing data in the database using html_entity_decode.
I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.
Change your form element to include accept-charset:
<form accept-charset="utf-8" method="post" ... >
<input type="text name="field" />
...
</form>
Validate the data with:
$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.
I think you miss Content-Type declaration on the html page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.