Working on an application and it's a bit of a mess with its encoding protocol.
The application currently uses php_value default_charset ISO-8859-1 but also in places does <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The MySql charset is Latin-1 (ISO-8859-1) so explains why the default_charset is being done.
There's also a wide range of encoding being done everywhere utf8_encode, json_encode, mb_convert_encoding (less prevalent)
The biggest issue we are seeing is with our mobile app REST API. People submitting emojis and such can cause some really strange behavior. Fields being emptied on display, etc.
Is there a standard protocol for handling this type of encoding to get a more uniform approach?
I have been in that hell too (self created by the way).
What the lazy textbook answer is: Redo the whole softwarestack in UTF-8. But that isn't always feasable (finance are tight, time is an issue, etc)
My practical advice would be:
Start with your database and MAKE it understand unicode (UTF-8). The database is at the core of your application, and schould be able to store unicode. This sounds maybe scary, but if it already uses LATIN-1 you can easily convert the relevant columns to unicode.
The PHP code schould not notice the difference.
From there on, make sure your PHP is using UTF-8
php_value default_charset ISO-8859-1 <-- change that
And last, you'll have the messy job to gradually look through all the code, and remove all convertions to LATIN-1.
I hope you can set up some kind of test environment, because hacking like this in a production environment is bad for your mental wellbeing. :-)
Good luck.
Erwin's answer is not complete.
MySQL's charset for UTF-8 is called utf8mb4, not simply utf8. You must use utf8mb4 to allow for Emoji.
"ISO-8859-1" is the same as MySQL's "latin1". But if you claim that it is UTF-8, a mess ensues. See Trouble with UTF-8 characters; what I see is not what I stored , especially the parts on "truncate", which seems to be your main symptom.
ALTER TABLE can be used in either of 2 ways to change the charset of individual or all columns. However, if the data has been truncated, it is lost and cannot be recovered without reloading. So, I suggest you change the table definitions and connection parameters and start over.
Do not use any mb conversion routines, that will make it even harder to debug.
Related
I am developing an Arabic web site. However, I use AJAX to save some text in my data base. The AJAX works fine with me. My problem is, when I save the data in my database and try to print it on my screen, it returns a weird text. I have used the PHP function mb_detect_encoding to determine how the database deals with the text. The function returned UTF-8.
So I used iconv("windows-1256","UTF-8",$row["text"]) to print the text on the screen, but it still returning this weird thing. Please give a hand
Thanks
please take a look at this thread (and use the search before posting a question first).
in your case, i think you've forgotten to set the chorrect charset for you database-connection (using a SET NAMES statement or mysql_set_charset()) - but thats hard to say.
this is a quote from chazomaticus, who has given a perfect answer in the liked thread, listing all the points you have to care of:
Storage:
Specify utf8_unicode_ci (or
equivalent) collation on all tables
and text columns in your database.
This makes MySQL physically store and
retrieve values natively in UTF-8.
Retrieval:
In PHP, in whatever DB wrapper you
use, you'll need to set the connection
charset to utf8. This way, MySQL does
no conversion from its native UTF-8
when it hands data off to PHP.
*
Note that if you don't use a DB
wrapper, you'll probably have to issue
a query to tell MySQL to give you
results in UTF-8: SET NAMES 'utf8'
(as soon as you connect).
Delivery:
You've got to tell PHP to deliver
the proper headers to the client, so
text will be interpreted as UTF-8. In
PHP, you can use the default_charset
php.ini option, or manually issue the
Content-Type header yourself, which
is just more work but has the same
effect.
Submission:
You want all data sent to you by
browsers to be in UTF-8.
Unfortunately, the only way to
reliably do this is add the
accept-charset attribute to all your
<form> tags: <form ...
accept-charset="UTF-8">.
Note
that the W3C HTML spec says that
clients "should" default to sending
forms back to the server in whatever
charset the server served, but this is
apparently only a recommendation,
hence the need for being explicit on
every single <form> tag.
Although, on that front, you'll still
want to verify every submitted string
as being valid UTF-8 before you try to
store it or use it anywhere. PHP's
mb_check_encoding() does the trick,
but you have to use it religiously.
Processing:
This is, unfortunately, the hard
part. You need to make sure that
every time you process a UTF-8 string,
you do so safely. Easiest way to do
this is by making extensive use of
PHP's mbstring extension.
PHP's
string operations are NOT by default
UTF-8 safe. There are some things you
can safely do with normal PHP string
operations (like concatenation), but
for most things you should use the
equivalent mbstring function.
To
know what you're doing (read: not mess
it up), you really need to know UTF-8
and how it works on the lowest
possible level. Check out any of the
links from utf8.com for some good
resources to learn everything you need
to know.
Also, I feel like this
should be said somewhere, even though
it may seem obvious: every PHP or HTML
file you'll be serving should be
encoded in valid UTF-8.
note that you don't need to use utf-8 - the important part is to use the same charset everywhere, independent of what charset that might be. but if you need to change things anyway, use utf-8.
I recommend changing your web pages to UTF-8.
Ideally, you should use the same encoding (UTF-8?) in your webpages, database, and JavaScript/AJAX. Many people forget to set charset for AJAX requests/responses, which gives you mangled data in some browsers (cough cough).
Thank you guys for your support, and sorry oezi for that confusion. I really made a search and didn't find my answer. However, it works fine with me now. I am going to explain what I did to make it work, so anybody else can get benefit of it:
- I made my tables charset to utf8_unicode_ci.
- To submit the data, I used AJAX with the default charset UTF-8.
- When I get the data from my DB, I used the iconv function as the follwoing
iconv("UTF-8","windows-1256",$row["text"]) , and it works
I hope that clear
I am creating a web base application using PHP and MySQL. I want it to be able to save any kind of user input characters, both English and non-English characters like Arabic or Japanese at the same time.
What should I do to achieve that?
You need to use Unicode. Read the MySQL manual section on Unicode and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You'll likely want to set the character set (encoding) of the table/columns in question to utf8. You'll also need to set the encoding of your HTML/PHP files to UTF-8. You can do this with a meta tag in <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
You can also set the Content-Type: header that Apache/PHP sends out.
Even after setting this, you may still run in to browser-specific issues. For example, Internet Explorer may not always use UTF-8, so Rails 3 had to put in a workaround.
For MySQL, you first need to define your data with the UTF8 character set:
CREATE DATABASE xx [...] DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE utf8_general_ci
And when creating database connections from PHP, you just need to run a quick command after opening it:
SET NAMES 'utf8'
Alternatively, if you have access to MySQL's my.ini, you can just add this to the config and forget the above:
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8
(note that's not php.ini, but MySQL's ini)
For PHP, if you need to manipulate multibyte strings: make sure you have the mbstring library active, and then change your string & regexp function calls to use the mb_* equivalent.
Also, make sure your editor is saving in UTF8 so everything's consistent. Eclipse/PDT makes it easy, at least (project -> properties -> text file encoding).
Finally, handling cultural differences: that's the hard part. Sometimes it's as easy as setting p { direction: rtl; } in CSS, and other times you'll be tearing your hair out trying to decipher what alphabet(s) a user just posted with. It depends on what you're doing with the different languages.
For starters, make sure that you read up on SQL injection. You would need to take strong precautions so that you safely encode the input. Usually, you'd be filtering/discarding unsafe content. So if you really need to allow it, then you need to be careful that you don't make it easy to hack yourself.
Essentially, you need the same sort of protection, while allowing "dangerous" content such as source code examples, that sites like this one use. Also systems that are commonly targeted such as PHPBB2, WordPress, Wiki, etc..
I think your task is harder if the data needs to be searchable.
If you are using PHP, the mysql_real_escape_string() function looks good:
http://www.tizag.com/mysqlTutorial/mysql-php-sql-injection.php
Otherwise, use somethign similar.
About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?
The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.
Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.
I'm still learning the ropes with PHP & MySQL and I know I'm doing something wrong here with how character sets are set up, but can't quite figure out from reading here and on the web what I should do.
I have a standard LAMP installation with PHP 5, MySQL 5. I set everything up with the defaults. When some of my users input comments to our database some characters show up incorrectly - mostly apostrophes and em dashes at the moment. In MySQL apostrostrophes show up as ’. They display on the page this way also (I'm using htmlentities to output user comments).
In phpMyAdmin it says my MySQL Charset is UTF8-Unicode.
In my database my tables are all set up with the default Latin1-Swedish-ci.
My web pages all have meta http-equiv="Content-Type" content="text/html; charset=utf-8"
When I look at the site's http headers I see: Content-Type: text/html
Like a newbie, I hadn't considered character sets at all until things started looking odd on some of my pages. So does it make most sense for me to convert everything to utf-8 and will this affect my PHP code? Or should I try to get it all into Latin? And do I have to go into the database and replace these odd codes, or will they magically display once I set up the charsets properly? All the fiddling I've done so far hasn't helped (I set the http headers to utf-8, and also tried latin).
If you really want to understand these issues, I would start by reading this article at mysql.com. Basically, you want every piece of the puzzle to expect UTF-8 unicode. On the PHP side, you want to do something like:
<?php header("Content-type: text/html; charset=utf-8");?>
<html>
<head>
<meta http-equiv="Content-type" value="text/html; charset=utf-8">
And when you run your insert queries you want to make sure both the table's character encoding and the encoding that you're running the queries in are UTF-8. You can accomplish the latter by running the query SET NAMES utf8 right before you run an insert query.
http://www.phpwact.org/php/i18n/charsets
That site gave me a lot of good advice on how to make everything play nice in UTF-8.
I also recomened switching from htmlentities to htmlspecialchars as it is more UTF friendly.
The main point is to make sure everything is talking the same language. Your database, your database connection, your PHP, your page is in utf8 (should have a meta tag and a header saying so).
Sorry for not understanding all of your question. But when part of the question is "UTF-8 or not?", the answer is: "UTF-8, of course!"
You definitely want to sort things out now rather than later. One of the most important programming rules is not to keep going with a bad idea - don't dig yourself in any deeper!
As latin1 and utf-8 are compatible, you can convert your tables to use utf-8 without manipulating the data contained by hand. MySQL will sort this part out for you.
It's then important to check that everything is speaking utf-8. Set the http headers in apache or use a meta tag - this says to a browser that the HTML output is utf-8.
With this in mind, you need to make sure all of the data you send really is utf-8! Configure your IDE to save php/html files as utf-8. Finally make sure that PHP is using a utf-8 connection to MySQL - issue this query after connecting:
SET NAMES 'utf-8';
It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.