I've made a test program that is basically just a textarea that I can enter characters into and when I click submit the characters are written to a MySQL test table (using PHP).
The test table is collation is UTF-8.
The script works fine if I want to write a é or ú to the database it writes fine. But then if I add the following meta statement to the <head> area of my page:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...the characters start becoming scrambled.
My theory is that the server is imposing some encoding that works well, but when I add the UTF-8 directive it overrides this server encoding and that this UTF-* encoding doesn't include the characters such as é and ú.
But I thought that UTF-8 encoded all (bar Klingon etc) characters.
Basically my program works but I want to know why when I add the directive it doesn't.
I think I'm missing something.
Any help/teaching most appreciated.
Thanks in advance.
Firstly, PHP generally doesn't handle the Unicode character set or UTF-8 character encoding. With the exception of (careful use of) mb_... functions, it just treats strings as binary data.
Secondly, you need to tell the MySQL client library what character set / encoding you're working with. The 'SET NAMES' SQL command does the job, and different MySQL clients (mysql, mysqli etc..) provide access to it in different ways, e.g. http://www.php.net/manual/en/mysqli.set-charset.php
Your browser, and MySQL client, are probably both defaulting to latin1, and coincidentally matching. MySQL then knows to convert the latin1 binary data into UTF-8. When you set the browser charset/encoding to UTF-8, the MySQL client is interpreting that UTF-8 data as latin1, and incorrectly transcoding it.
So the solution is to set the MySQL client to a charset matching the input to PHP from the browser.
Note also that table collation isn't the same as table character set - collation refers to how strings are compared and sorted. Confusing stuff, hope this helps!
Related
I have the following problem: on a very simple php-mysqli query:
if ( $result = $mysqli->query( $sqlquery ) )
{
$res = $result->fetch_all();
$result->close();
}
I get strings wrongly encoded as Western encoded string, although the database, the table and the column is in utf8_general_ci collation. The php script itself is utf-8 encoded and the mysql-less parts of the script get the correct encodings. So say echo "ő" works perfectly, but echo $res[0] from the previous example outputs the EF BF BD character when the file viewed in the correct UTF-8 encoding. If I manually switch the browser's encoding to Western, the mysqli sourced strings get good decoding, except for the non-western characters being replaced with "?'.
What makes it even stranger is that on my development environment this isn't happening, while on my webserver it is. The developer environment is a LAMP stack (The Uniform Server), while the webserver uses nginx.
In this case, I entered the data in the database using phpMyAdmin, and inside phpmyadmin it displays perfectly. phpMyAdmin's collation is utf-8 too. I believe that the problem must be somewhere around here, as on the same webserver, for an other site where I enter data through php (using POST) the same problem doesn't happen. On that case, the data is visible correctly both while entering and while viewing it (I mean in the php generated webpages), but the special characters are not correct in phpMyAdmin.
Can you help me start where to debug? Is it connected to php or mysql or nginx or phpMyAdmin?
Use mysqli_set_charset to change the client encoding to UTF-8 just after you connect:
$mysqli->set_charset("utf8");
The client encoding is what MySql expects your input to be in (e.g. when you insert user-supplied text to a search query) and what it gives you the results in (so it has to match your output encoding in order for echo to display things correctly).
You need to have it match the encoding of your web page to account for the two scenarios above and the encoding of the PHP source file (so that the hardcoded parts of your queries are interpreted correctly).
Update: How to convert data inserted using latin-1 to utf-8
Regarding data that have already been inserted using the wrong connection encoding there is a convenient solution to fix the problem. For each column that contains this kind of data you need to do:
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET latin1;
ALTER TABLE table_name MODIFY column_name BLOB;
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET utf8;
The placeholders table_name, column_name and existing_column_type should be replaced with the correct values from your database each time.
What this does is
Tell MySql that it needs to store data in that column in latin1. This character set contains only a small subset of utf8 so in general this conversion involves data loss, but in this specific scenario the data was already interpreted as latin1 on input so there will be no side effects. However, MySql will internally convert the byte representation of your data to match what was originally sent from PHP.
Convert the column to a binary type (BLOB) that has no associated encoding information. At this point the column will contain raw bytes that are a proper utf8 character string.
Convert the column to its previous character type, telling MySql that the raw bytes should be considered to be in utf8 encoding.
WARNING: You can only use this indiscriminate approach if the column in question contains only incorrectly inserted data. Any data that has been correctly inserted will be truncated at the first occurrence of any non-ASCII character!
Therefore it's a good idea to do it right now, before the PHP side fix goes into effect.
Use mysqli::set_charset function.
$mysqli->set_charset('utf8'); //returns false if the encoding was not valid... won't happen
http://php.net/manual/en/mysqli.set-charset.php
I haven't used mysqli for some time, but if things are the same, connections by default use the latin swedish encoding (ISO 8859 1).
I will consider your page is already using utf8 encoding by having:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Inside the <head> tag.
If you have string already on latin swedish encoding, you can use mk_convert_encoding:
http://php.net/manual/en/function.mb-convert-encoding.php
$fixedStr = mb_convert_encoding($wrongStr, 'UTF-8', 'ISO-8859-1');
iconv does something very similar: Truth be told, I don't know the difference, but here's the link to the function reference:
http://php.net/manual/en/function.iconv.php
I just realized that you might have some strings in utf8 and others in latin swedish. You can use mb_detect_encoding for that: http://php.net/manual/en/function.mb-detect-encoding.php
You can also dump the database and use iconv (cmd line) if you have it installed:
iconv -f latain -t utf-8 < currentdb.sql > fixeddb.sql
In districts table
I have a row as
district_id district_name country_id
15 Šahty 16
While selecting from php and displaying in browser,it shows like this :�ahty
I am using mssql 2005 with collation SQL_Latin1_General_CP1_CI_AS.
The problem is something like this
removing accent and special characters
but i need the solution in php.
UPDATE(?):
There is no support for UTF-8 in sqlserver.
https://dba.stackexchange.com/questions/7346/mssql-2005-2008-utf-8-collation-charset
Hi you need to consider correct HTML content type header
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Data may be selected correctly, but browser may be can not displayed them as you expected.
You can play with this in firefox by Menu View -> Character encoding -> until you find correct one
In order to make special characters work, a general rule is that all the components must be on the same encoding. This means that database, database connection (very often forgotten 'SET NAMES {charset}' call after connecting to database) and web page Content-type have to be all in the same character set.
If you ask data from latin1 database and have database connection also has latin1, make sure the page you display values at is also latin1.
It's recommended though to use UTF-8 instead of latin1 everywhere, so if possible I recommending changing charsets and data in your database all to UTF-8, as it's more compatible all-around and easier to handle.
Just remove the UTF8 charset and let the browser select the charset it will set to ISO-8859-1 that will work with accents in sql server
So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:
MySQL - all data is stored with the Latin1 character set
PHP - All PHP text files are stored on disk with Latin1 encoding
HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag
So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.
So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.
What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?
The META tag is luckily added dynamically, so I'll change that in one place only :)
Let me hear about your experiences with this.
It's tricky.
You have to:
change the DB and every table character set/encoding – I don't know much about MySQL, but see here
set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
change the meta tag and possible the Content-type header (note the Content-type header has precedence)
convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.
Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.
Functions
Certain function calls don't work anymore. For latin1 it's:
echo htmlentities($string);
For UTF8 it's:
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
strlen(), substr(), etc. Aren't aware of the multibyte characters.
MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.
So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)
Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the ë notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.
I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.
Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.
Also, what document type have you set, is it?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Try using the htmlentities() function for any text that is not showing correctly.
You may also want to have a look PHP Normalizer.
Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.
(Also, your sql table should be using utf8, not latin1)
It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).
Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.
(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).
Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags