Client / Server / DB. Should i care about other charsets than utf-8? - php

Is utf-8/unicode used by most of the clients ?
And if not, should i care, the user could change the charset in the browser settings / or update the software ?
I am working with MySQL and PHP (and others), my db and my tables use unicode, because it will contain usernames and texts in different languages.
In php I work with :
-multibyte string functions
-regexp (u modifier and unicode letters \p{L})
-an is_string_utf8 function, so everything else will be rejected
The u-modifer wants valid utf-8, so the input has to be utf-8 (or not ?)
I also use prepared statements, that and my is_string_utf8 function are supposed to prevent multi-byte attacks.
Does it work ? -Yes.
But if the user has another charset than unicode in his browser, it won't work properly, is_string_utf8 will reject most of the sended information.
So, my questions : Should i care about ISO and the other charsets ?
Isn't utf-8 the standard by now ?
Could i use mb_convert_encoding or is it more complicated than that, converting all charsets into utf-8 ?
Is converting the charset still secure ?
Thank you very much in advance.

The browser will use the character set encoding that your website specifies in the pages it serves to the clients. I don't know how the user can set a character set in the browser. According to http://w3techs.com/technologies/details/en-utf8/all/all around 85% of websites serve pages encoded with UTF-8. Since UTF-8 can encode any unicode character you'll be just fine by having all data between your site and your users encoded in UTF-8.

You need to set the encoding of the data you're receiving from your client and not let it to chance.
HTML forms should set accept-charset attribute to set the character set encoding:
<form method="post" action="/your/url/" accept-charset="UTF-8">
See UTF-8 all the way through for further information about ensuring UTF-8 is saved and served correctly

Are you running some kind of service? Then simply mandate that everyone use utf8 (utf8mb4 in MySQL and UTF-8 outside MySQL).
Note that I said utf8mb4 -- this is because of Chinese, Emoji, and several other things not covered by MySQL's utf8.
When generating html output, be sure to have the meta tag specifying UTF8.
I have seen a lot of question come through this and other forums; most use utf8; only a few deal with other character sets. I suggest that the other character sets could (and should) be relegated to the dust bin as antiquated and no longer of much use. (Remember EBCDIC?)
It is good that you are validating the client's text. However, non-utf8 bytes will be truncated when storing into a column with CHARACTER SET utf8 (or utf8mb4).

Related

Html special characters in email

I had written a script to read email from a mailbox.
in some email i am getting some data being converted into wiered characters that are breaking my further processing.
those character looks something like this http://brucejohnson.ca/HTMLCharacters13.html
Any idea how to convert them into original content.
if the script is giving you those characters, then you have two options, see the character as is, or see the numerical equivalent of that character (in various bases - octal, hex etc).
Are you sure that your script isn't trying to read an encrypted mail, and that your script works fine?
Try putting some dummy test data through the functions/script you've written to see if it produces the output you expect.
Hope this helps
You need to check the charset encoding in the email headers first.
Once you have done this you then chose 1 of 2 methods, change the charset in the HTML or change the charset (where possible) to the charset you're already using (probably UTF-8)
If you dynamically change the HTML charset in the header then your biggest problem is the users will need to specify the correct charset in their browser settings, for example mine is set to UTF-8 however my emails are in ISO-8859-1 so if I was to employ this method every time I look at the site I would need to change my browser charset but a friend of mine has ISO-8859-1 as his normal charset so he would have no problems.
If you encode the characters to UTF-8 (e.g. utf8_encode in php) you need to ensure the content isn't already in UTF-8 otherwise you may find the encode function creates other invalid characters.
The way I handle this is basically to decode the mime header of the email, then use preg_match in PHP to detect the charset being used, from there I run the encoding to UTF-8 or not.
This is a very complicated activity at times dealing mail and various charsets based on the sender of the email, you don't really know in advance what charset will be used so you need to really understand the various charsets, how they are best stored if storing them and how they are best displayed, you then need to translate this to your app and target market.
GOod luck with your app
have u checked the character encoding It must be UTF-8. If it is western europian then change to UTF-8

META value charset=UTF-8 prevents UTF-8 characters showing

I've made a test program that is basically just a textarea that I can enter characters into and when I click submit the characters are written to a MySQL test table (using PHP).
The test table is collation is UTF-8.
The script works fine if I want to write a é or ú to the database it writes fine. But then if I add the following meta statement to the <head> area of my page:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...the characters start becoming scrambled.
My theory is that the server is imposing some encoding that works well, but when I add the UTF-8 directive it overrides this server encoding and that this UTF-* encoding doesn't include the characters such as é and ú.
But I thought that UTF-8 encoded all (bar Klingon etc) characters.
Basically my program works but I want to know why when I add the directive it doesn't.
I think I'm missing something.
Any help/teaching most appreciated.
Thanks in advance.
Firstly, PHP generally doesn't handle the Unicode character set or UTF-8 character encoding. With the exception of (careful use of) mb_... functions, it just treats strings as binary data.
Secondly, you need to tell the MySQL client library what character set / encoding you're working with. The 'SET NAMES' SQL command does the job, and different MySQL clients (mysql, mysqli etc..) provide access to it in different ways, e.g. http://www.php.net/manual/en/mysqli.set-charset.php
Your browser, and MySQL client, are probably both defaulting to latin1, and coincidentally matching. MySQL then knows to convert the latin1 binary data into UTF-8. When you set the browser charset/encoding to UTF-8, the MySQL client is interpreting that UTF-8 data as latin1, and incorrectly transcoding it.
So the solution is to set the MySQL client to a charset matching the input to PHP from the browser.
Note also that table collation isn't the same as table character set - collation refers to how strings are compared and sorted. Confusing stuff, hope this helps!

Changing character encoding in MySQL, PHP scripts, HTML

So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:
MySQL - all data is stored with the Latin1 character set
PHP - All PHP text files are stored on disk with Latin1 encoding
HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag
So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.
So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.
What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?
The META tag is luckily added dynamically, so I'll change that in one place only :)
Let me hear about your experiences with this.
It's tricky.
You have to:
change the DB and every table character set/encoding – I don't know much about MySQL, but see here
set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
change the meta tag and possible the Content-type header (note the Content-type header has precedence)
convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.
Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.
Functions
Certain function calls don't work anymore. For latin1 it's:
echo htmlentities($string);
For UTF8 it's:
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
strlen(), substr(), etc. Aren't aware of the multibyte characters.
MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.
So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)
Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the ë notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

php mysql character set: storing html of international content

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.
What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table.
I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?
do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.
more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?
MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.
I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.
It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.
My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods.
PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.
Emil H. correctly suggested using
SET NAMES utf8
which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.
Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.
In adition to what Emil H said, you also need this in your page head tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Categories