Special characters in mySQL (and php) - THE BASICS

Special characters in mySQL (and php) - THE BASICS - php

I am confused! Recently my webhotel updated php and now my old tables render special characters differently (wrongly).
Both my tables and my input/output-php-pages are set to utf-8 and since this update, also the inputs from php are treated differently; now my special characters are being utf-8-encoded as they enter the database. So since this change, when I review tables within phpMyAdmin, the old inserts have the original (non-encoded) special characters - the new posts have utf-8-encoded charcters (also special).
So what I would like to do is rewrite input and output to insert and show non-encoded characters - but I am not sure if this is possible without skipping utf-8 entirely (in php and mySQL). But is there an utf-8- way to submit non-encoded characters?
AND - perhaps more fundamentally - I need to understand what the possible downsides are. I am using Danish characters in and out and I'm not going to use any other language (for this project). So if it IS possible to insert and output non-encoded characters using utf-8 - am I then going to have unexpected/destructive issues?
I have read a lot of posts regarding php/mySQL/special characters but I haven't seen this angle on the issue yet. Hope I am not duplicating
I hope not because it has been working very nicely until the update.

Even if you are using only Danish characters, you may as well go utf8 all the way.
There are many places where the encoding needs to be stated:
The at the top of the html
The columns in the database (column CHARACTER SET defaults from table, which defaults from database)
The encoding in your PHP code.
When you CREATE TABLE, tack on DEFAULT CHARACTER SET utf8. If you have existing tables, without that, speak up; we may need to deal with them.
If you want Danish collation, the specify COLLATION utf8_danish_ci, too. Then (if I recall correctly), aa will sort after z.
(The default is utf8_general_ci, which won't do that sorting.)
Figure out what encoding you have (or can get) in your php code. If you have some text with accents in it, do this:
$hex = unpack('H*', $text);
echo implode('', $hex)
If you have utf8, å will be C3A5, for latin1 it will be E5.
Regardless of what encoding in in the tables, you must call set_charset('utf8') or set_charset('latin1') depending on what encoding is in the data in PHP. MySQL will gladly transcode between latin1 and utf8 as things are passed between PHP and MySQL. For different APIs:
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
For much more info, see http://mysql.rjweb.org/doc.php/charcoll .

Related

Croatian letters enconding + Twitter Bootstrap + Helvetica Neue Pro

Twitter bootstrap won't show letter đ ć č ž š,Croatian letters.
I set charset to UTF-8 and it won't do...
I'm getting data form mysql and in database is ok, first i tought it was font Helvetica,so I get Helvetica Neue Pro LT but...not sure i load it correctly...any idea how to load...Twitter bootstap + Font Awesome,
so I tired with Arial, and it won't work!
so any help?
tnx alot

There can be several reasons.
Before troubleshooting the database, you should first try with some static content to ensure that your editor and server is correctly set up.
Try adding some html content to your php file that contains your croatian characters. If these characters come out wrong in your browser, make sure:
Your code editor saves your PHP files using UTF-8 encoding
Your webserver outputs your PHP files as UTF-8. To check this, look at the http header in your browser. There should be a line named "Content-Type" with a value of "text/html; charset=UTF-8". Look at this screenshot from Chrome to locate the http header: http://www.jesperkristensen.dk/webstandarder/doctype-chooser/chrome.png
If the static characters come out right, the next step is to troubleshoot the database.
The character set
Computers only know numbers by nature, so internally the computer thinks of letters and characters as numbers. For example, the letter a is, by default, stored as the number 97 on American computers, while b is 98. For a complete list, see http://www.asciitable.com/
Very simplified put, whenever displaying characters on the screen, the computer will use this numeric value and look up the value in a font library to find the appropriate glyph to display on screen.
The set of glyphs (characters) that the computer is searching whenever it is displaying some text is called the character set. The specific encoding rules, that define what numbers map to what glyphs in the character set, are called character encodings.
When people talk about the ASCII set they are talking about a collection of glyphs that include the English alphabet in both uppercase and lowercase, the arabic numbers (0-9) and a handful of special characters. But they may also be referring to the ASCII character encoding which specifies which numbers map to which glyphs. Again - see www.asciitable.com
Unfortunately, there are more than English latin characters and as computers proliferated throughout the world in the 60's and 70's, local character sets, fonts and encodings were invented to suit local needs. These have esoteric names like ISO-8859-5, EUC-JP or IBM860.
Attempting to read a text using a different character encoding standard than the text was encoded with would often cause headaches. English characters would work, because they are represented the same across different encodings and sets, but anything else would break. For instance, the character æ is a special Danish vowel character that has numeric value of 230 when using the ISO-8859-1 standard which was the predominant encoding standard in Denmark. However, if you saved a text file containing In Denmark, an apple is called æble. to a floppy disk and sent it to a friend in Bulgaria, his computer would assume that the text file is encoded using the ISO-8859-5 standard for cyrillic texts and it would show up as In Denmark, an apple is called цble which is wrong, because according to the ISO-8859-5 standard the numeric value of 230 maps to ц.
To compare the two character encodings, please see:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/ISO/IEC_8859-5
In the olden days, developers had to pick which character set and encoding to use for their application, because no universal solution existed, but fortunately a new character set called Unicode evolved in the 1990's.
This character set includes thousands and thousands of glyphs from around the world, enough to cover pretty much every current alphabet and language in the world. Together with Unicode, a few specifications on how to encode text was devised. The most popular today is called utf-8, which is conveniently backwards compatible with the old American ASCII 7-bit character encoding. Because of that, all valid ASCII text is also valid utf-8 text. This backward compatibility is also a curse, because it frequently leads novice developers to conclude that their software is working, when in fact it is only working with characters present in the ASCII character set.
First step - making sure your database is storing text as utf-8
Before you can display the text correctly, it must be stored correctly. Use your SQL management tool to check the character encoding for the table you are working with. If no information is present, defaults are probably inherited from the database schema or server configuration, so check that too.
If your table is NOT using utf-8 or you are unable to verify that it is using utf-8, you may run this command in your SQL tool against your table to explicitly instruct the database to store the data using utf-8:
ALTER TABLE name_of_your_table
CHARACTER SET utf8
COLLATE utf_unicode_ci;
This tells the database to store data using utf-8 encoding, making it capable of storing your special characters. Databases also uses a concept called collation which is a set of rules on how glyphs are sorted and compared. For instance ß may be interpreted as two s characters in german, while other languages might consider it a special character that comes before or after normal latin characters when sorting. Unless you have a good reason, use the utf_unicode_ci collation which is language agnostic and will usually sort your things correctly. the _ci in the name means case insensitive, meaning that when doing comparisons such as WHERE country = USA, records in lowercase will also match.
Second step - the webserver and the database needs to speak utf-8 together
Now that your database is storing things the right way, you have to make sure your webserver and database are communicating correctly too. Again, a multitude of environment settings affect their defaults, so it's a good idea to be explicit when connecting to the database. If you are using PDO in PHP to connect, you can use the following example to connect (taken from php.net) :
<?php
$dsn = 'mysql:host=localhost;dbname=testdb';
$username = 'username';
$password = 'password';
$options = array(
PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8',
);
$dbh = new PDO($dsn, $username, $password, $options);
?>
What is important here is the options associative array. It contains "SET NAMES utf8" which is a SQL command that is run against the database whenever a connection is opened. It has two implications.
It instructs the database that any queries sent subsequently will be encoded using utf-8. That way the database will understand non-ascii characters coming from the webserver.
It instructs the database that any responses returned from the database to the web server, will be assumed, by the web server, to be encoded using utf-8 and treated as such.
With a database that stores your data using utf-8 encoding and a web server that connects and transfers query results using utf-8 encoding, you should be ready to display your croatian characters on your website.

List of 'messed up characters' in utf8

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. We've had troubles before with character sets but now it's just straight forward a drama!
So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. And now it's time to replace all characters. So far what I've found is:
Ã¶ = ö
Ã« = ë
Ã© = é
The data inside the database is being updated like so:
UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'Ã¶', 'ö');
Now I just need to find a complete list of alle characters that are messed up. I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database.
Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client.
If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse).
// EDIT:
it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel.
// EDIT 2:
I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet. I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems.
The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). And the files are all saved as UTF8 without BOM. Also PHPFreakMailer is updated by setting the charset to utf-8.
Bad enough, I'm still having these weird characters. I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? Continuïng using the REPLACE query and changing all wierd characters manually?
Thanks in advance!

This is a bit crazy; what character set do you think "Ã¶" is in?
It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1.
Edit:
Based on your comment I think the following is going on:
I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. So it thinks it's ISO-8559-1 (which looks like Ã¶), but then tries to convert that to UTF-8.
You should be able to verify this, if strlen('Ã¶') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up.
To fix this, don't set the MySQL to encode the characters.
Option 2
The data could also be 'double encoded' in the table. To check this, simply also check the string length on the database. If the 'Ã¶' is 4 bytes long, this is the issue.
My advice in this case is to not try to make a big 'messed up character'-map. You should simply be able to 'utf8_decode' the string. Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string.
I hope this works!
Edit2
Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms:
$output = utf8_encode(utf8_encode('string'));
So one utf8_decode() should be enough.
Do test this before you run your migration scripts though :)

If they forced a character change, why is your database not converted? Are your tables still the old character set (see your phpMyAdmin on table information).
Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? -> your names and collation should change, as well as headers and filetype (safe file as utf-8).
Or try:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
I would start replacing characters only if there are no options from within MySQL left.

Since you've tagged this question with "php", I assume you read the database and it's values with PHP? If so, please have a look at mb_convert_encoding if you no longer have control over the database.
The better solution would be to fix the inconsistency between the data and the tables characterset. Backup the database (just in case), and alter all tables and columns to UTF-8. Note: when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column.

Why don't you use: ä = ä, ö = ö,...
Do htmlentities(); in php and it will convert all special characters into Entitys. I think this would be the easiest way to do it.

mysql and encoding

I moved my php application to the new server. i use mysql5 db. When i'm Updating or Inserting something to db, every " and - sign changed to ?. I use SET NAMES UTF8 and SET CHARACTER SET but it don't work. Any ideas?

SET NAMES UTF8 should be used on every page, when selecting as well as when updating or inserting.
actually this query must be used every time you connect to the database. just add it to connect code.

You need UTF-8 all the way through to make smart quotes and dashes (“”—) and other non-ASCII characters work reliably:
(1) Ensure that the browser sends you characters encoded to UTF-8. Do this by declaring the page that includes the form to be UTF-8:
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
...
(Ignore <form accept-encoding>, which doesn't work right in IE.)
(2) PHP deals with raw bytes and doesn't care what encoding they're in, but the database does care, so you have to tell it what encoding the bytes from PHP are coming in. This is what SET NAMES is doing, though mysql_set_charset may be preferable.
(3) Once the proper characters have reached the database, it'll need to store them in a Unicode encoding to make sure all characters can fit. Each column can have a different encoding, but you can use DEFAULT CHARACTER SET utf8 when you CREATE table to make all the text columns in it use UTF-8. You can also set the default character set for a database or the whole server to utf8 if you prefer.
If you have already CREATEd the tables and they a non-UTF-8 collation, you'll have to recreate or alter the tables. You can check the current collation using SHOW FULL COLUMNS FROM sometable;.
(4) Make sure you HTML-encode text you output from PHP using htmlspecialchars() and not htmlentities(), which by default will mess up non-ASCII characters.
[You can, as an alternative to (2) and (3), just use the default Latin-1 encoding for the connection and the table storage, but put UTF-8 bytes in it nonetheless. The disadvantage of this approach is that it'll look wrong to other tools looking at the database, and lower/upper case characters won't compare against each other in the expected case-insensitive way.]

My guess is you are pasting from some text editor which is transforming the " into an angled pretty quote, and transforming your - into an mdash, which is causing both to be represented as ?.
While you set your database to accept UTF8 characters, you probably did not set your webserver/PHP to accept those characters. Try playing with mbstring functions, but check to make sure you arent using the slanted quotes or dashes.

How do browsers/PHP handle characters outside the set characterset?

I'm looking into how characters are handled that are outside of the set characterset for a page.
In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.
As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed.
I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.
Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...

htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.
A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).

He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).
btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.

Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.
using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.

Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.
You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

php mysql character set: storing html of international content

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.
What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table.
I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?
do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.
more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?

MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.

I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.

It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.
My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods.
PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.
Emil H. correctly suggested using
SET NAMES utf8
which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.
Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.

In adition to what Emil H said, you also need this in your page head tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.