This has been always a problem for me , Character problem . I always tried to solve my problem with little patches , actually this never solves my problem in reality.So I am looking for very strong solution to solve all these problems.I want to learn how big apps(facebook , google, other multi lingual ajax apps and apis) solve this problem. I want a solution which will solve all my character encoding , etc problems.I use php, mysql, html and javascript to create my application , so the solution should solve all problems or all these languages together.If you write full configuration this is perfect , but if there is a long long document , I can read it to . I need help . Thank you . I can not transfer string(text) correctly through all these languages
Also I pull data from external apis.How should I take care of them
It's pretty easy if you just stick to using Unicode everywhere.
set MySQL table encodings to UTF-8
make sure you're talking to the database in UTF-8 by running SET NAMES utf8
save all your source code in UTF-8
when manipulating strings in PHP which may contain UTF-8 characters, use the mb_ functions
send HTTP Content-Type headers denoting that the content is in UTF-8
Javascript is intrinsically UTF-8, so you should have no worries there
The thing is that different technologies default to different character encodings. Unfortunately strings do not have implicit encoding metadata attached, they're just sequences of bytes. Unless being told, the receiver of a string can only make a best guess what encoding that sequence is supposed to be in. Whenever connecting two pieces of anything, you need to make sure they're using the same encoding (or you need to specifically convert from one encoding to the other). Always assume that you have to define the encoding somewhere, how exactly that needs to be done depends on the technology.
Related
I'm working on an application using the CakePHP framework and in the past I ran into a few encounters with encoding.
To avoid these issues in my application, I started doing some research. But I'm still a little confused about the how and why.
My application will need to support all languages, yes even languages such as Chineese. Most of the data will be stored into a MySQL database, and that's where confusion starts. What should I use as collation?
Based on what I've read the past few days, I come to the conclusion the best choice for collation would be utf8_unicode_ci. Is this correct?
Now onto the PHP, what would I set as encoding? UTF-8? I need to completely be sure not a single character shows up the way it shouldn't. Content will be submitted through forms, so the output has to be the same as the input.
I hope anyone can give me an answer to my questions and help clarify it to me, thanks in advance.
You need UTF-8 encoding to store you data. But as for collation, it is used to sort strings. Unfortunatelly, there exists no universal collation, and such universal collation can not exists, because collations are contradictory.
To make a point on example, in Czech 'ch' goes after 'h', opposite to most other Latin script languages.
Yes, utf8_unicode_ci is a sane choice when you don't know in advance the language. As for PHP I'll just link to some answers I wrote in the past:
How to best configure PHP to handle a UTF-8 website
Croatian diacritic signs in MySQL db (utf-8)
Am I correctly supporting UTF-8 in my PHP apps?
One additional advice would be to make sure your text editor saves all files as UTF-8 (NO BOM, if you have this option). In short, keep everything utf-8 from the very beginning and you should be safe.
I am developing an Arabic web site. However, I use AJAX to save some text in my data base. The AJAX works fine with me. My problem is, when I save the data in my database and try to print it on my screen, it returns a weird text. I have used the PHP function mb_detect_encoding to determine how the database deals with the text. The function returned UTF-8.
So I used iconv("windows-1256","UTF-8",$row["text"]) to print the text on the screen, but it still returning this weird thing. Please give a hand
Thanks
please take a look at this thread (and use the search before posting a question first).
in your case, i think you've forgotten to set the chorrect charset for you database-connection (using a SET NAMES statement or mysql_set_charset()) - but thats hard to say.
this is a quote from chazomaticus, who has given a perfect answer in the liked thread, listing all the points you have to care of:
Storage:
Specify utf8_unicode_ci (or
equivalent) collation on all tables
and text columns in your database.
This makes MySQL physically store and
retrieve values natively in UTF-8.
Retrieval:
In PHP, in whatever DB wrapper you
use, you'll need to set the connection
charset to utf8. This way, MySQL does
no conversion from its native UTF-8
when it hands data off to PHP.
*
Note that if you don't use a DB
wrapper, you'll probably have to issue
a query to tell MySQL to give you
results in UTF-8: SET NAMES 'utf8'
(as soon as you connect).
Delivery:
You've got to tell PHP to deliver
the proper headers to the client, so
text will be interpreted as UTF-8. In
PHP, you can use the default_charset
php.ini option, or manually issue the
Content-Type header yourself, which
is just more work but has the same
effect.
Submission:
You want all data sent to you by
browsers to be in UTF-8.
Unfortunately, the only way to
reliably do this is add the
accept-charset attribute to all your
<form> tags: <form ...
accept-charset="UTF-8">.
Note
that the W3C HTML spec says that
clients "should" default to sending
forms back to the server in whatever
charset the server served, but this is
apparently only a recommendation,
hence the need for being explicit on
every single <form> tag.
Although, on that front, you'll still
want to verify every submitted string
as being valid UTF-8 before you try to
store it or use it anywhere. PHP's
mb_check_encoding() does the trick,
but you have to use it religiously.
Processing:
This is, unfortunately, the hard
part. You need to make sure that
every time you process a UTF-8 string,
you do so safely. Easiest way to do
this is by making extensive use of
PHP's mbstring extension.
PHP's
string operations are NOT by default
UTF-8 safe. There are some things you
can safely do with normal PHP string
operations (like concatenation), but
for most things you should use the
equivalent mbstring function.
To
know what you're doing (read: not mess
it up), you really need to know UTF-8
and how it works on the lowest
possible level. Check out any of the
links from utf8.com for some good
resources to learn everything you need
to know.
Also, I feel like this
should be said somewhere, even though
it may seem obvious: every PHP or HTML
file you'll be serving should be
encoded in valid UTF-8.
note that you don't need to use utf-8 - the important part is to use the same charset everywhere, independent of what charset that might be. but if you need to change things anyway, use utf-8.
I recommend changing your web pages to UTF-8.
Ideally, you should use the same encoding (UTF-8?) in your webpages, database, and JavaScript/AJAX. Many people forget to set charset for AJAX requests/responses, which gives you mangled data in some browsers (cough cough).
Thank you guys for your support, and sorry oezi for that confusion. I really made a search and didn't find my answer. However, it works fine with me now. I am going to explain what I did to make it work, so anybody else can get benefit of it:
- I made my tables charset to utf8_unicode_ci.
- To submit the data, I used AJAX with the default charset UTF-8.
- When I get the data from my DB, I used the iconv function as the follwoing
iconv("UTF-8","windows-1256",$row["text"]) , and it works
I hope that clear
After answering Zend_Cache: After loading cached data, character encoding seems messed up
I use it to change the PHP's internal encoding , its originally ISO-8859-1,
so I need to change the encoding of every non English input value, but using it I force PHP to convert every value to UTF-8, as you might see in the question linked above.
I am Caching arabic text in files using Zend_cache, I wasn't be able to do it without that function.
I need to know: How bad is to use this function mb_internal_encoding("UTF-8");?
I had adopt to use this function in every project I opt in , all of them are using non-english characters
About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.
What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?
The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:
Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?
There are at least three different places you can declare the charset for a web document. Be sure you change them all:
the HTTP Content-Type header
the <meta http-equiv="Content-Type"> tag in your documents' <head>
the <?xml> tag at the top of the document, if using XHTML Strict
All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:
Latin-1 → UTF-8 → Latin-1 → UTF-8
So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.
The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.
As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.
Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.
Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).
IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:
PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.
My next web application project will make extensive use of Unicode. I usually use PHP and CodeIgniter however Unicode is not one of PHP's strong points.
Is there a PHP tool out there that can help me get Unicode working well in PHP?
Or should I take the opportunity to look into alternatives such as Python?
PHP can handle unicode fine once you make sure to encode and decode on entry and exit. If you are storing in a database, ensure that the language encodings and charset mappings match up between the html pages, web server, your editor, and the database.
If the whole application uses UTF-8 everywhere, decoding is not necessary. The only time you need to decode is when you are outputting data in another charset that isn't on the web. When outputting html, you can use
htmlentities($var, ENT_QUOTES, 'UTF-8');
to get the correct output. The standard function will destroy the string in most cases. Same goes for mail functions too.
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet is a very good resource for working in UTF-8
One of the Major feature of PHP 6 will be tightly integrated with UNICODE support.
Implementing UTF-8 in PHP 5.
Since PHP strings are byte-oriented, the only practical encoding scheme for Unicode text is UTF-8. Tricks are [Got it from PHp Architect Magazine]:
Present HTML pages in UTF-8
Convert PHP scripts to UTF-8
Convert the site content, back-end databases and the like to UTF-8
Ensure that no PHP functions corrupt the UTF-8 text
Check out http://www.gravitonic.com/talks/ PHP UTF 8 Cheat Sheet
PHP is mostly unaware of chrasets and treats strings as bytestreams. That's not much of a problem really, but you'll have to do a bit of work your self.
The general rule of thumb is that you should use the same charset everywhere. If you use UTF-8 everywhere, then you're 99% there. Just make sure that you don't mix charsets, because then it gets really complicated. The only thing that won't work correct with UTF-8, is string manipulation, which needs to operate on a character level. Eg. strlen, substr etc. You should use UTF-8-aware versions in place of those. The multibyte-string extension gives you just that.
For a checklist of places where you need to make sure the charset is set correct, look at:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
For more information, look at:
http://www.phpwact.org/php/i18n/utf-8