Charset encoding problem - php

I am developing an Arabic web site. However, I use AJAX to save some text in my data base. The AJAX works fine with me. My problem is, when I save the data in my database and try to print it on my screen, it returns a weird text. I have used the PHP function mb_detect_encoding to determine how the database deals with the text. The function returned UTF-8.
So I used iconv("windows-1256","UTF-8",$row["text"]) to print the text on the screen, but it still returning this weird thing. Please give a hand
Thanks

please take a look at this thread (and use the search before posting a question first).
in your case, i think you've forgotten to set the chorrect charset for you database-connection (using a SET NAMES statement or mysql_set_charset()) - but thats hard to say.
this is a quote from chazomaticus, who has given a perfect answer in the liked thread, listing all the points you have to care of:
Storage:
Specify utf8_unicode_ci (or
equivalent) collation on all tables
and text columns in your database.
This makes MySQL physically store and
retrieve values natively in UTF-8.
Retrieval:
In PHP, in whatever DB wrapper you
use, you'll need to set the connection
charset to utf8. This way, MySQL does
no conversion from its native UTF-8
when it hands data off to PHP.
*
Note that if you don't use a DB
wrapper, you'll probably have to issue
a query to tell MySQL to give you
results in UTF-8: SET NAMES 'utf8'
(as soon as you connect).
Delivery:
You've got to tell PHP to deliver
the proper headers to the client, so
text will be interpreted as UTF-8. In
PHP, you can use the default_charset
php.ini option, or manually issue the
Content-Type header yourself, which
is just more work but has the same
effect.
Submission:
You want all data sent to you by
browsers to be in UTF-8.
Unfortunately, the only way to
reliably do this is add the
accept-charset attribute to all your
<form> tags: <form ...
accept-charset="UTF-8">.
Note
that the W3C HTML spec says that
clients "should" default to sending
forms back to the server in whatever
charset the server served, but this is
apparently only a recommendation,
hence the need for being explicit on
every single <form> tag.
Although, on that front, you'll still
want to verify every submitted string
as being valid UTF-8 before you try to
store it or use it anywhere. PHP's
mb_check_encoding() does the trick,
but you have to use it religiously.
Processing:
This is, unfortunately, the hard
part. You need to make sure that
every time you process a UTF-8 string,
you do so safely. Easiest way to do
this is by making extensive use of
PHP's mbstring extension.
PHP's
string operations are NOT by default
UTF-8 safe. There are some things you
can safely do with normal PHP string
operations (like concatenation), but
for most things you should use the
equivalent mbstring function.
To
know what you're doing (read: not mess
it up), you really need to know UTF-8
and how it works on the lowest
possible level. Check out any of the
links from utf8.com for some good
resources to learn everything you need
to know.
Also, I feel like this
should be said somewhere, even though
it may seem obvious: every PHP or HTML
file you'll be serving should be
encoded in valid UTF-8.
note that you don't need to use utf-8 - the important part is to use the same charset everywhere, independent of what charset that might be. but if you need to change things anyway, use utf-8.

I recommend changing your web pages to UTF-8.

Ideally, you should use the same encoding (UTF-8?) in your webpages, database, and JavaScript/AJAX. Many people forget to set charset for AJAX requests/responses, which gives you mangled data in some browsers (cough cough).

Thank you guys for your support, and sorry oezi for that confusion. I really made a search and didn't find my answer. However, it works fine with me now. I am going to explain what I did to make it work, so anybody else can get benefit of it:
- I made my tables charset to utf8_unicode_ci.
- To submit the data, I used AJAX with the default charset UTF-8.
- When I get the data from my DB, I used the iconv function as the follwoing
iconv("UTF-8","windows-1256",$row["text"]) , and it works
I hope that clear

Related

Help with multi-lingual text, php, and mysql

I have had no end of problems trying to do what I thought would be relatively simple:
I need to have a form which can accept user input text in a mix of English an other languages, some multi-byte (ie Japanese, Korean, etc), and this gets processed by php and is stored (safely, avoiding SQL injection) in a mysql database. It also needs to be accessed from the database, processed, and used on-screen.
I have it set up fine for Latin chars but when I add a mix of Latin andmulti-byte chars it turns garbled.
I have tried to do my homework but just am banging my head against a wall now.
Magic quotes is off, I have tried using utf8_encode/decode, htmlentities, addslashes/stripslashes, and (in mysql) both "utf8_general_ci" and "utf8_unicode_ci" for the field in the table.
Part of the problem is that there are so many places where I could be messing it up that I'm not sure where to begin solving the problem.
Thanks very much for any and all help with this. Ideally, if someone has working php code examples and/or knows the right mysql table format, that would be fantastic.
Here is a laundry list of things to check are in UTF8 mode:
MySQL table encoding. You seem to have already done this.
MySQL connection encoding. Do SHOW STATUS LIKE 'char%' and you will see what MySQL is using. You need character_set_client, character_set_connection and character_set_results set to utf8 which can easily set in your application by doing SET NAMES 'utf8' at the start of all connections. This is the one most people forget to check, IME.
If you use them, your CLI and terminal settings. In bash, this means LANG=(something).UTF-8.
Your source code (this is not usually a problem unless you have UTF8 constant text).
The page encoding. You seem to have this one right, too, but your browsers debug tools can help a lot.
Once you get all this right, all you will need in your app is mysql_real_escape_string().
Oh and it is (sadly) possible to successfully store correctly encoded UTf8 text in a column with the wrong encoding type or from a connection with the wrong encoding type. And it can come back "correctly", too. Until you fix all the bits that aren't UTF8, at which point it breaks.
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
First you need to check if you can add multi-language text to your database directly. If its possible you can do it in your application
Are you serializing any data by chance? PHPs serialize function has some issue when serializing non-english characters.
Everything you do should be utf-8 encoded.
One thing you could try is to json_encode() the data when putting it into the database and json_decoding() it when it's retrieved.
The problem was caused by my not having the default char set in the php.ini file, and (possibly) not having set the char set in the mysql table (in PhpMyAdmin, via the Operations tab).
Setting the default char set to "utf-8" fixed it. Thanks for the help!!
Check your database connection settings. It also needs to support UTF-8.

What should I do to save all kind of user input characters in MySQL database?

I am creating a web base application using PHP and MySQL. I want it to be able to save any kind of user input characters, both English and non-English characters like Arabic or Japanese at the same time.
What should I do to achieve that?
You need to use Unicode. Read the MySQL manual section on Unicode and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You'll likely want to set the character set (encoding) of the table/columns in question to utf8. You'll also need to set the encoding of your HTML/PHP files to UTF-8. You can do this with a meta tag in <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
You can also set the Content-Type: header that Apache/PHP sends out.
Even after setting this, you may still run in to browser-specific issues. For example, Internet Explorer may not always use UTF-8, so Rails 3 had to put in a workaround.
For MySQL, you first need to define your data with the UTF8 character set:
CREATE DATABASE xx [...] DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE utf8_general_ci
And when creating database connections from PHP, you just need to run a quick command after opening it:
SET NAMES 'utf8'
Alternatively, if you have access to MySQL's my.ini, you can just add this to the config and forget the above:
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8
(note that's not php.ini, but MySQL's ini)
For PHP, if you need to manipulate multibyte strings: make sure you have the mbstring library active, and then change your string & regexp function calls to use the mb_* equivalent.
Also, make sure your editor is saving in UTF8 so everything's consistent. Eclipse/PDT makes it easy, at least (project -> properties -> text file encoding).
Finally, handling cultural differences: that's the hard part. Sometimes it's as easy as setting p { direction: rtl; } in CSS, and other times you'll be tearing your hair out trying to decipher what alphabet(s) a user just posted with. It depends on what you're doing with the different languages.
For starters, make sure that you read up on SQL injection. You would need to take strong precautions so that you safely encode the input. Usually, you'd be filtering/discarding unsafe content. So if you really need to allow it, then you need to be careful that you don't make it easy to hack yourself.
Essentially, you need the same sort of protection, while allowing "dangerous" content such as source code examples, that sites like this one use. Also systems that are commonly targeted such as PHPBB2, WordPress, Wiki, etc..
I think your task is harder if the data needs to be searchable.
If you are using PHP, the mysql_real_escape_string() function looks good:
http://www.tizag.com/mysqlTutorial/mysql-php-sql-injection.php
Otherwise, use somethign similar.

Getting funny squares in browser when displaying content

I have content stored in a Postgres DB, now everytime I call the content so that it gets displayed using php, i get funny squares in IE and funny square type question marks in Firefox?
Example below
* - March � May 2009
How do I remove this?
I do not have access to the server so can't adjust the encoding there, only have postgres DB details and FTP access to upload my files
I would also recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, I've read it only recently myself, it will definitely help you sort out your problems.
You need to make sure that Postgres, PHP, and your browser all agree on the content encoding, and that you have an appropriate font selected in your browser. The simplest way to do that is to choose UTF8 for everything.
I don't know about PHP, but I do know about databases and browsers. First you need to find out if the database is UTF8. (From psql, I would do a "\l" and look at the encoding.) Then you need to find out if PHP supports UTF8 (I have no idea how you do that). Then you need to see if how those characters are being stored in the database by the PHP app. Then you need to figure out if the web server is correctly reporting the content encoding. (On Linux/Unix, I'd use the program "HEAD" (not "head") to see the headers its returning.) And then you need to figure out if your browser is using a font that supports UTF8.
Or, you could just make sure you only store ASCII and forget the rest of the world exists. Not recommended.
Wrong charset somewhere. The characters could be stored wrong already in database, or you have wrong charset in meta tags on the page(try manually change charset in browser), or there could be problem with wrong encoding when page is communicating with database.
Check this page http://www.postgresql.org/docs/8.2/static/multibyte.html for more informations.
Try to have same encoding on all places, preferably UTF-8
You have encoding issues. Make sure the encoding is set right in the database, in the html markup and make sure the files themselves are saved in proper encoding.

php mysql character set: storing html of international content

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.
What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table.
I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?
do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.
more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?
MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.
I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.
It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.
My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods.
PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.
Emil H. correctly suggested using
SET NAMES utf8
which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.
Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.
In adition to what Emil H said, you also need this in your page head tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

How do I ensure that user entered data containing international characters doesn't get corrupted?

It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.
I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.
I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.
How do I ensure that user entered data containing international characters doesn't get corrupted?
On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one).
Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a
SET NAMES utf8
query prior to any other query you send to the MySQL server.
If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using
$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");
when initializing your db connection.
Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.
That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:
The database. This can be set on database, table and field level, and even on a per-query level.
Connection between PHP and database.
HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).
utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.
There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.
Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.
Things you should do:
Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
Make sure your database connection is UTF8. SET NAMES utf8 does the trick.
Make sure all your tables are set to UTF8.
Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.
You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.
For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -
default-character-set=utf8
skip-character-set-client-handshake //Important so the client doesn't enforce another encoding
Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.
Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:
Identify the encoding
somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
Handle your bytes
think of these strings less like 'strings' of characters, and more like lists of bytes
PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
Store for the LCD
Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.

Categories