Global utf8_encode - php

Is there a way to apply utf8_encode globally? I'm trying to get the entire page to load different languages properly without having to insert utf8_encode before every variable.
echo '<p>'.utf8_encode($LANG_CALL_TO_ACTION).'</p>'

The fact that you have to use utf8_encode this frequently is probably a symptom of an architectural problem.
utf8_encode is a function that converts iso-8859-1 encoded data to UTF-8.
In a modern setup, you won't need to use it at all: The incoming data will already be UTF-8 encoded.
If it is not, if the data comes from a legacy ISO-8859-1 database (which is totally okay), you should use an appropriate output encoding instead of UTF-8 (i.e. in this case ISO-8859-1).
It is also possible to globally convert the data if really, really necessary, but to give any advice on that, we'd need to know much more about your setup. It also is probably a bad idea to do.

You might not have to. utf8_encode() is for converting an ISO-8859-1 string into a UTF8 encoded one - it does you no good if your data is already UTF-8 and you're just having display issues.
What it seems to me like you're after is to make sure all of your contents is displayed correctly as UTF-8, for which you just need to set the proper HTTP header.
My preferred method for doing so is this
ini_set( 'default_charset', 'UTF-8' );

Related

Is it a good practice to use mb_convert_encoding function

This question is different from UTF-8 all the way through as it asks for how safe and is it a good practice to use the mb_convert_encoding function.
Lets say that a user can upload the files using the PHP API. Each filename and path gets stored in a PostgreSQL database table which has UTF-8 as default encoding.
Sometimes user uploads files which names aren't UTF-8 encoded and they get imported into the database. The problem is that the characters that are not UTF-8 encoded are scrambled and do not display as they should in the table columns.
I was thinking of adding the following to the PHP code before import:
if ( ! mb_check_encoding($output, 'UTF-8') {
$output = mb_convert_encoding($content, 'UTF-8');
}
Does this look like a good practice and will it be displayed and converted by the user's client correctly if I return UTF-8 as the output? Is there a potential loss to the bytes by using mb_convert_encoding?
Thanks
If you're going to convert an encoding, you need to know what you're converting from. You can check whether the encoding is or isn't valid UTF-8, but if it tells you it's not valid UTF-8 then you still have no clue what it is. Omitting the $from_encoding parameter from mb_convert_encoding just makes it assume some preset encoding for that parameter, but that doesn't mean that $content actually is in that encoding.
In other words: if you don't know what encoding a string is in, you cannot meaningfully convert it to anything else either, and just trying to convert it from ¯\_(ツ)_/¯ is a crapshoot with the result being equally likely to be something useful and utter garbage.
If you encounter unknown encodings, you only have a few choices:
Reject the input value.
Test whether it's one of a handful of other expected encodings and then explicitly convert from your best guess; but that is pretty much a crapshoot as well.
Just use bin2hex or something similar on the value, essentially giving up on trying to interpret it correctly, but still leaving some semblance to the original value.

When is the correct time to use utf8_encode and utf8_decode?

Character encoding has always been a problem for me. I don't really get when the correct time to use it is.
All the databases I use now I set up with utf8_general_ci, as that seems to a good 'general' start. I have since learned in the past five minutes that it is case insensitive. So that's helpful.
But my question is when to use utf8_encode and utf8_decode ? As far as I can see now, If I $_POST a form from a table on my website, I need to utf8_encode() the value before I insert it into the database.
Then when I pull it out, I need to utf8_decode it. Is that the case? Or am I missing something?
utf8_encode and _decode are pretty bad misnomers. The only thing these functions do is convert between UTF-8 and ISO-8859-1 encodings. They do exactly the same thing as iconv('ISO-8859-1', 'UTF-8', $str) and iconv('UTF-8', 'ISO-8859-1', $str) respectively. There's no other magic going on which would necessitate their use.
If you receive a UTF-8 encoded string from the browser and you want to insert it as UTF-8 into the database using a database connection with the utf8 charset set, there is absolutely no use for either function anywhere in this chain. You are not interested in converting encodings at all here, and that should be the goal.
The only time you could use either function is if you need to convert from UTF-8 to ISO-8859-1 or vice versa at any point, because external data is encoded in this encoding or an external system expects data in this encoding. But even then, I'd prefer the explicit use of iconv or mb_convert_encoding, since it makes it more obvious and explicit what is going on. And in this day and age, UTF-8 should be the default go-to encoding you use throughout, so there should be very little need for such conversion.
See:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
UTF-8 all the way through
Basically utf8_encode is used for Encodes an ISO-8859-1 string to UTF-8.
When you are working on translation like One language to Another language than you have to use this function to prevent to show some garbage Characters.
Like When you display spanish character than some time script doesn't recognize spanish character and it will display some garbage character instead of spanish character.
At that time you can use.
For more refer about this please follow this link :
http://php.net/manual/en/function.utf8-encode.php

Html special characters in email

I had written a script to read email from a mailbox.
in some email i am getting some data being converted into wiered characters that are breaking my further processing.
those character looks something like this http://brucejohnson.ca/HTMLCharacters13.html
Any idea how to convert them into original content.
if the script is giving you those characters, then you have two options, see the character as is, or see the numerical equivalent of that character (in various bases - octal, hex etc).
Are you sure that your script isn't trying to read an encrypted mail, and that your script works fine?
Try putting some dummy test data through the functions/script you've written to see if it produces the output you expect.
Hope this helps
You need to check the charset encoding in the email headers first.
Once you have done this you then chose 1 of 2 methods, change the charset in the HTML or change the charset (where possible) to the charset you're already using (probably UTF-8)
If you dynamically change the HTML charset in the header then your biggest problem is the users will need to specify the correct charset in their browser settings, for example mine is set to UTF-8 however my emails are in ISO-8859-1 so if I was to employ this method every time I look at the site I would need to change my browser charset but a friend of mine has ISO-8859-1 as his normal charset so he would have no problems.
If you encode the characters to UTF-8 (e.g. utf8_encode in php) you need to ensure the content isn't already in UTF-8 otherwise you may find the encode function creates other invalid characters.
The way I handle this is basically to decode the mime header of the email, then use preg_match in PHP to detect the charset being used, from there I run the encoding to UTF-8 or not.
This is a very complicated activity at times dealing mail and various charsets based on the sender of the email, you don't really know in advance what charset will be used so you need to really understand the various charsets, how they are best stored if storing them and how they are best displayed, you then need to translate this to your app and target market.
GOod luck with your app
have u checked the character encoding It must be UTF-8. If it is western europian then change to UTF-8

Charset encoding problem

I am developing an Arabic web site. However, I use AJAX to save some text in my data base. The AJAX works fine with me. My problem is, when I save the data in my database and try to print it on my screen, it returns a weird text. I have used the PHP function mb_detect_encoding to determine how the database deals with the text. The function returned UTF-8.
So I used iconv("windows-1256","UTF-8",$row["text"]) to print the text on the screen, but it still returning this weird thing. Please give a hand
Thanks
please take a look at this thread (and use the search before posting a question first).
in your case, i think you've forgotten to set the chorrect charset for you database-connection (using a SET NAMES statement or mysql_set_charset()) - but thats hard to say.
this is a quote from chazomaticus, who has given a perfect answer in the liked thread, listing all the points you have to care of:
Storage:
Specify utf8_unicode_ci (or
equivalent) collation on all tables
and text columns in your database.
This makes MySQL physically store and
retrieve values natively in UTF-8.
Retrieval:
In PHP, in whatever DB wrapper you
use, you'll need to set the connection
charset to utf8. This way, MySQL does
no conversion from its native UTF-8
when it hands data off to PHP.
*
Note that if you don't use a DB
wrapper, you'll probably have to issue
a query to tell MySQL to give you
results in UTF-8: SET NAMES 'utf8'
(as soon as you connect).
Delivery:
You've got to tell PHP to deliver
the proper headers to the client, so
text will be interpreted as UTF-8. In
PHP, you can use the default_charset
php.ini option, or manually issue the
Content-Type header yourself, which
is just more work but has the same
effect.
Submission:
You want all data sent to you by
browsers to be in UTF-8.
Unfortunately, the only way to
reliably do this is add the
accept-charset attribute to all your
<form> tags: <form ...
accept-charset="UTF-8">.
Note
that the W3C HTML spec says that
clients "should" default to sending
forms back to the server in whatever
charset the server served, but this is
apparently only a recommendation,
hence the need for being explicit on
every single <form> tag.
Although, on that front, you'll still
want to verify every submitted string
as being valid UTF-8 before you try to
store it or use it anywhere. PHP's
mb_check_encoding() does the trick,
but you have to use it religiously.
Processing:
This is, unfortunately, the hard
part. You need to make sure that
every time you process a UTF-8 string,
you do so safely. Easiest way to do
this is by making extensive use of
PHP's mbstring extension.
PHP's
string operations are NOT by default
UTF-8 safe. There are some things you
can safely do with normal PHP string
operations (like concatenation), but
for most things you should use the
equivalent mbstring function.
To
know what you're doing (read: not mess
it up), you really need to know UTF-8
and how it works on the lowest
possible level. Check out any of the
links from utf8.com for some good
resources to learn everything you need
to know.
Also, I feel like this
should be said somewhere, even though
it may seem obvious: every PHP or HTML
file you'll be serving should be
encoded in valid UTF-8.
note that you don't need to use utf-8 - the important part is to use the same charset everywhere, independent of what charset that might be. but if you need to change things anyway, use utf-8.
I recommend changing your web pages to UTF-8.
Ideally, you should use the same encoding (UTF-8?) in your webpages, database, and JavaScript/AJAX. Many people forget to set charset for AJAX requests/responses, which gives you mangled data in some browsers (cough cough).
Thank you guys for your support, and sorry oezi for that confusion. I really made a search and didn't find my answer. However, it works fine with me now. I am going to explain what I did to make it work, so anybody else can get benefit of it:
- I made my tables charset to utf8_unicode_ci.
- To submit the data, I used AJAX with the default charset UTF-8.
- When I get the data from my DB, I used the iconv function as the follwoing
iconv("UTF-8","windows-1256",$row["text"]) , and it works
I hope that clear

php mysql character set: storing html of international content

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.
What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table.
I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?
do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.
more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?
MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.
I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.
It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.
My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods.
PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.
Emil H. correctly suggested using
SET NAMES utf8
which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.
Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.
In adition to what Emil H said, you also need this in your page head tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Categories