MySQL: Different charsets for different text contents, does it worth?

MySQL: Different charsets for different text contents, does it worth? - php

I have my database with utf8mb4 in all tables and all char/varchar/text columns. All is working fine but I was wondering if I really need it for all columns. I mean, I have columns that will contain user text that require utf8mb4 since the user can type in any language, insert emoticons, and so on. However I have different columns that will contain other kind of strings like user access tokens, country codes, user nicknames that does not contain strange characters, and so on.
Does it worth to change the charset of these columns to something like ascii or latin1? It would improve database space, efficiency? My feel is that set a charset like utf84mb for something that will never contain unicode characters is a waste of 'something'... but I really do not know how this is managed internally by MySQL.
In the other side I am connecting to this database from php and setting the connection charset to uft8mb4, so I suppose that all non utf8 columns will be converted automatically. I suppose is not a problem as utf8 is superset of ascii or latin1.
Any tips? pros and contras? Thanks!

The short answer is to make all your columns and tables defaulting to the same thing, UTF-8.
The long answer is because of the way UTF-8 is encoded, where ASCII will map 1:1 with UTF-8 and not incur any additional storage overhead like you might experience with UTF-16 or UTF-32, it's not a big deal. If you're storing non-ASCII characters it will take more space, but if you're storing those, you'll need the support anyway.
Having mixed character sets in your tables is just asking for trouble. The only exception is when defining BINARY or BLOB type columns that are not UTF-8 but instead binary.
Even the documentation makes it clear the only place this is an issue is with CHAR columns rather than VARCHAR, but it's not really a good idea to use CHAR columns in the first place.

ASCII is a strict subset of UTF-8, so there is exactly zero gain in space efficiency if you have nothing that uses special characters stored in UTF-8. There is a marginal improvement in space efficiency if you use latin-1 instead of UTF-8 for storing latin-derived text (special characters that UTF-8 uses 2 bytes for can be stored with just one byte in latin-1), but you gain a lot of headaches on the way, and you lose compatibility with wider character sets.
For example, ñ is stored as 0xC3 0xB1 in UTF-8, whereas latin-1 stores it as 0xF1. On the other hand, a is 0x61 in both encodings. The clever guys that invented UTF8 did it this way. You save a single byte, only for special characters.
TL;DR Use UTF-8 for everything. If you have to ask, you don't need anything else.

Related

Charsets and Databases

My boss likes to use n-dashes. They always cause problems with encoding and I cannot work out why.
I store my TEXT field in a database under the charset: utf8_general_ci.
I have the following tags under my <head> on my webpage:
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I pull the information from my database with the following set:
mysql_set_charset('UTF8',$connection);
(I know MYSQL is depreciated)
But when I get information from the database, I end up with this:
Ã¢Â€Â“ Europe
If I take this string and run it through utf8_decode, I get this:
â�?�? Europe
I even tried running it thorugh utf8_encode, and I got this:
ÃÂ¢Ãâ¬Ãâ Europe
Can someone explain to me why this is happening? I dont understand. I even ran the string through mb_detect_encoding and It said the string was utf8. So why is not printing correctly?
The solution (or not really a solution, because it ruins the rest of the website) is to remove the mysql_set_encoding line, and use utf8_decode. Then it prints out fine. BUT WHY!?

You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011, which for brevity I will express in hexadecimal notation as 0xe28093, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293 (which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset (in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1 unless you override their default configuration). In order for a result that MySQL transcodes to latin1 to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1 again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ... to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...—if an entire column is misencoded, you can instead use ALTER TABLE to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.

Special characters in mySQL (and php) - THE BASICS

I am confused! Recently my webhotel updated php and now my old tables render special characters differently (wrongly).
Both my tables and my input/output-php-pages are set to utf-8 and since this update, also the inputs from php are treated differently; now my special characters are being utf-8-encoded as they enter the database. So since this change, when I review tables within phpMyAdmin, the old inserts have the original (non-encoded) special characters - the new posts have utf-8-encoded charcters (also special).
So what I would like to do is rewrite input and output to insert and show non-encoded characters - but I am not sure if this is possible without skipping utf-8 entirely (in php and mySQL). But is there an utf-8- way to submit non-encoded characters?
AND - perhaps more fundamentally - I need to understand what the possible downsides are. I am using Danish characters in and out and I'm not going to use any other language (for this project). So if it IS possible to insert and output non-encoded characters using utf-8 - am I then going to have unexpected/destructive issues?
I have read a lot of posts regarding php/mySQL/special characters but I haven't seen this angle on the issue yet. Hope I am not duplicating
I hope not because it has been working very nicely until the update.

Even if you are using only Danish characters, you may as well go utf8 all the way.
There are many places where the encoding needs to be stated:
The at the top of the html
The columns in the database (column CHARACTER SET defaults from table, which defaults from database)
The encoding in your PHP code.
When you CREATE TABLE, tack on DEFAULT CHARACTER SET utf8. If you have existing tables, without that, speak up; we may need to deal with them.
If you want Danish collation, the specify COLLATION utf8_danish_ci, too. Then (if I recall correctly), aa will sort after z.
(The default is utf8_general_ci, which won't do that sorting.)
Figure out what encoding you have (or can get) in your php code. If you have some text with accents in it, do this:
$hex = unpack('H*', $text);
echo implode('', $hex)
If you have utf8, å will be C3A5, for latin1 it will be E5.
Regardless of what encoding in in the tables, you must call set_charset('utf8') or set_charset('latin1') depending on what encoding is in the data in PHP. MySQL will gladly transcode between latin1 and utf8 as things are passed between PHP and MySQL. For different APIs:
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);
For much more info, see http://mysql.rjweb.org/doc.php/charcoll .

Form saves special latin characters as symbols

My PHP form is submitting special latin characters as symbols.
So, Québec turns into QuÃ©bec
My form is set to UTF-8 and my database table has latin1_swedish_ci collation.
PHP: $db = new PDO('mysql:host=localhost;dbname=x;charset=utf8', 'x', 'x');
A bindParam: $sql->bindParam(":x", $_POST['x'],PDO::PARAM_STR);
I am new to PDO so I am not sure what the problem is. Thank you
*I am using phpMyAdmin

To expand a little bit more on the encoding problem...
Any time you see one character in a source turn into two (or more characters), you should immediately suspect an encoding issue, especially if UTF-8 is involved. Here's why. (I apologize if you already know some of this, but I hope to help some future SO'ers as well.)
All characters are stored in your computer not as characters, but as bytes. Back in the olden days, space and transmission time were much more limited than now, so people tried to save every byte possible, even down to not using a full byte to store a character. Now, because we realize that we need to communicate with the whole world, we've decided it's more important to be able to represent every character in every language. That transition hasn't always been smooth, and that's what you're running up against.
Latin-1 (in various flavors) is an encoding that always uses a single 8-bit byte for a character. Which means it can only have 256 possible characters. Plenty if you only want to write English or Swedish, but not enough to add Russian and Chinese. (background on Latin-1)
UTF-8 encodes the first half of Latin-1 in exactly the same way, which is why you see most of the characters looking the same. But it doesn't always use a single byte for a character -- it can use up to four bytes on one character. (utf-8) As you discovered, it uses 2 bytes for é. But Latin-1 doesn't know that, and is doing its best to display those two bytes.
The trick is to always specify your encoding for byte streams (like info from a file, a URL, or a database), and to make sure that encoding is correct. (Sometimes that's a pain to find out, for sure.) Most modern languages, like Java and PHP do a good job of handling all the translation issues between different encodings, as long as you've correctly specified what you're dealing with.

You've pretty much answered your own question: you're receiving UTF-8 from the form but trying to store it in a Latin-1 column. You can either change the encoding on the column in MySQL or use the iconv function to translate between the two encodings.

Change your database table and column to utf8_unicode_ci.

Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
Set headers:
<?php header("Content-type: text/html; charset=utf-8"); ?>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Do I really need to switch from VARCHAR to VARBINARY for UTF-8 in Mysql & PHP?

Do I really need to switch from VARCHAR to VARBINARY and TEXT to BLOB for UTF-8 in Mysql & PHP? Or can I stick with CHAR/TEXT fields in MySQL?

Not necessarily.
MySQL's UTF-8 support is limited to only 3 byte UTF8, which includes everything upto and including the Basic Multilingual Plane. It is only if you need characters which are in the 4 byte range that you need to use BLOB storage; this is rare, but not totally uncommon. See the Wikipedia article for a breakdown of what you'll be missing, and decide if there's anything there that is a must have.

Maybe. As jason pointed out and I failed to notice, MySQL UTF-8 does only map the Basic Multilingual Plane. The manual does point out however, that "They [utf8 and ucs2] are sufficient for almost all characters in major languages" So, it is probably safe but you might want to check out what is in the Basic Multilingual Plane just to be sure.
Orignal Answer
As long as your database is using UTF-8 you should be able to stick with VARCHAR and TEXT. (As a side note, the MySQL manual recommends using VARCHAR over CHAR with UTF-8 to save space. As this is the case, it should be safe to use VARCHAR and TEXT.)

Here's a nice link on dealing with UTF-8 in PHP. MySQL does very well with UTF-8 if you set the collation right. PHP on the other hand has lots of problems.

Of course it is safe to use VARCHAR to store UTF-8 text and no VARBINARY is needed for that.
VARCHAR is a "CHARACTER WITH VARIABLE LENGTH", which will flawlessly adapt to the number of BYTES needed to store the characters according to the CHARCODE selected.
There is also a reason why MySQL's UTF-8 support is limited to only 3 bytes. You would need to dive into the related UTF-8 docs that talk about the encoding procedure of UTF-8 to understand why that's correct.
And last but not least: if you're unsure about UTF-8, you can always opt-in to UTF-16. Yet, you'll still be using VARCHAR as it will flawlessly adapt to the correct byte-length nevertheless.

PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:
pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"
I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").
When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?
(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)

Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.
You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".

BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.
The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.