I want to be able to store every character possible (Chinese, Arabic, these kind of characters: ☺♀☻) in a MySQL database and also be able to use them in PHP and HTML. how do I do this?
Edit: when I use the function htmlspecialchars() with those characters: ☺♀☻ like this: htmlspecialchars('☺♀☻', ENT_SUBSTITUTE, 'UTF-8'); it returns some seemingly random characters. how do I solve this?
Use UTF-8 character encoding for all text/var fields in your database, as well as page encoding. Be sure to use multibyte (mb_*) forms of text functions, such as mb_substr().
Pick a character set that has the characters you want. utf-8 is very broad most commonly used.
Storing the characters is not so much a problem since it's all just binary data. If you also want the text to be searchable then picking the right collation is useful. utf8_general_ci is fine.
Related
Character encoding has always been a problem for me. I don't really get when the correct time to use it is.
All the databases I use now I set up with utf8_general_ci, as that seems to a good 'general' start. I have since learned in the past five minutes that it is case insensitive. So that's helpful.
But my question is when to use utf8_encode and utf8_decode ? As far as I can see now, If I $_POST a form from a table on my website, I need to utf8_encode() the value before I insert it into the database.
Then when I pull it out, I need to utf8_decode it. Is that the case? Or am I missing something?
utf8_encode and _decode are pretty bad misnomers. The only thing these functions do is convert between UTF-8 and ISO-8859-1 encodings. They do exactly the same thing as iconv('ISO-8859-1', 'UTF-8', $str) and iconv('UTF-8', 'ISO-8859-1', $str) respectively. There's no other magic going on which would necessitate their use.
If you receive a UTF-8 encoded string from the browser and you want to insert it as UTF-8 into the database using a database connection with the utf8 charset set, there is absolutely no use for either function anywhere in this chain. You are not interested in converting encodings at all here, and that should be the goal.
The only time you could use either function is if you need to convert from UTF-8 to ISO-8859-1 or vice versa at any point, because external data is encoded in this encoding or an external system expects data in this encoding. But even then, I'd prefer the explicit use of iconv or mb_convert_encoding, since it makes it more obvious and explicit what is going on. And in this day and age, UTF-8 should be the default go-to encoding you use throughout, so there should be very little need for such conversion.
See:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
UTF-8 all the way through
Basically utf8_encode is used for Encodes an ISO-8859-1 string to UTF-8.
When you are working on translation like One language to Another language than you have to use this function to prevent to show some garbage Characters.
Like When you display spanish character than some time script doesn't recognize spanish character and it will display some garbage character instead of spanish character.
At that time you can use.
For more refer about this please follow this link :
http://php.net/manual/en/function.utf8-encode.php
I have my database with utf8mb4 in all tables and all char/varchar/text columns. All is working fine but I was wondering if I really need it for all columns. I mean, I have columns that will contain user text that require utf8mb4 since the user can type in any language, insert emoticons, and so on. However I have different columns that will contain other kind of strings like user access tokens, country codes, user nicknames that does not contain strange characters, and so on.
Does it worth to change the charset of these columns to something like ascii or latin1? It would improve database space, efficiency? My feel is that set a charset like utf84mb for something that will never contain unicode characters is a waste of 'something'... but I really do not know how this is managed internally by MySQL.
In the other side I am connecting to this database from php and setting the connection charset to uft8mb4, so I suppose that all non utf8 columns will be converted automatically. I suppose is not a problem as utf8 is superset of ascii or latin1.
Any tips? pros and contras? Thanks!
The short answer is to make all your columns and tables defaulting to the same thing, UTF-8.
The long answer is because of the way UTF-8 is encoded, where ASCII will map 1:1 with UTF-8 and not incur any additional storage overhead like you might experience with UTF-16 or UTF-32, it's not a big deal. If you're storing non-ASCII characters it will take more space, but if you're storing those, you'll need the support anyway.
Having mixed character sets in your tables is just asking for trouble. The only exception is when defining BINARY or BLOB type columns that are not UTF-8 but instead binary.
Even the documentation makes it clear the only place this is an issue is with CHAR columns rather than VARCHAR, but it's not really a good idea to use CHAR columns in the first place.
ASCII is a strict subset of UTF-8, so there is exactly zero gain in space efficiency if you have nothing that uses special characters stored in UTF-8. There is a marginal improvement in space efficiency if you use latin-1 instead of UTF-8 for storing latin-derived text (special characters that UTF-8 uses 2 bytes for can be stored with just one byte in latin-1), but you gain a lot of headaches on the way, and you lose compatibility with wider character sets.
For example, ñ is stored as 0xC3 0xB1 in UTF-8, whereas latin-1 stores it as 0xF1. On the other hand, a is 0x61 in both encodings. The clever guys that invented UTF8 did it this way. You save a single byte, only for special characters.
TL;DR Use UTF-8 for everything. If you have to ask, you don't need anything else.
I have this following
$html = <div>ياں ان کي پرائيويٹ ليمٹڈ کمپنياں ہيں</div>
But it is being stored in the mysql database as following format
تو يہ اسمب
لي ميں غر
يب کو آنے
نہيں
Actually, When I retrieve the data from mysql database and shows it on the webpage it is shown correctly.
But I want to know that Is it the standard format of unicode to store in the database, or the unicode data should be stored as it is (ياں ان کي پرائيويٹ ليمٹڈ کمپنياں ہيں)
When you store unicode in your database...
First off, your database has to be set as 'utf-general', which is not the default. With MySQL, you have to set both the table to utf format, AND individual columns to utf. In addition to this, you have to be sure that your connection is a utf-8 connection, but doing that varies based on what method you use to store the unicode text into your database.
To set your connection's char-set, if you are using Mysqli, you would do this:
$c->set_charset('utf8'); where $c is a Mysqli connection.
Still, you have to change your database charsets like I said before.
EDIT: I honestly don't think it matters MUCH how you store it, though I store it as the actual unicode characters, because that way if some user were to input '& #1610;' into the database, it wouldn't be retrieved as a unicode character by mistake.
EDIT: Here is a good example, if you remove that space between & and #1610; in my answer, it will be mistakenly retrieved from the server as a unicode character, unless you want users to be able to create unicode characters by using a code like that.
Not a perfect example since stackoverflow does that on purpose, and it doesn't work like that really, but the concept is the same.
Something wrong with data charset. I don't know what exactly.
This is workaround. Do it before insert/update:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
it looks like to me that this is HTML encoding, the way PHP encode unicode to make sure it will display OK on the web page, no matter the page encoding.
Did you tried to fetch the same data using MySQL Workbench?
It seems that somewhere in your PHP code htmlentities is being used on the text -- instead of htmlspecialchars. The difference with htmlentities is that it escapes a lot of non-ASCII characters in the form you see there. Then the result of that is being stored in the database. It's not MySQL's doing.
In theory this shouldn't be necessary. It should be okay to output the plain characters if you set the character set of the page correctly. Aassuming UTF-8, for example, use header('Content-Type: text/html; charset=utf-8'); or <meta http-equiv="Content-Type" value="text/html; charset=utf-8">.
This might result in gibberish (mojibake) if you view the database directly (although it will display fine on the web page) unless you also make sure the character set of the database is set correctly. That means the table columns, table, database, and connection character set all to, probably, utf8mb4_general_bin or utf8_general_bin (or ..._general_ci). In practice getting it all working can be a bit of a nuisance. If you didn't write this code, then probably someone in your code base decided at some point to use htmlentities on it to convert the exotic characters to ASCII HTML entities, to make storage easier. Or sometimes people use htmlentities out of habit when the merer htmlspecialchars would be fine.
There are a lot of topics about latin1_swedisch_ci to utf8 conversion. But what about the other way around? I'm dealing for quite a long time with this problem and I haven't found a solution so far. Since I don't know what else is accessing this database, I don't want to change the character encoding of the table.
I have in the table a column which is formatted in latin1_swedisch_ci. Now I have to write queries in php. This database contains German and French names, meaning that I have characters like ö,ä,ô and so on. How can I do that?
As an example if I want to query the name 'Bürki', then I have to write something like $name='Bürki'. Is there a proper way to convert it to latin1_swedisch_ci without using string replacement for those special characters?
iconv() will convert strings from one encoding to the other.
The encodings that are of interest to you are utf-8 and iso-8859-1 - the latter is equivalent with latin1.
The "swedish", "german" etc. localizations affect issues like sorting only, the character encoding is always the same.
PS.
then I have to write something like $name='Bürki'.
If you encode your source file as UTF-8, you can write Bürki directly. (You would then have to convert that string into iso-8859-1)
I agree with Pekka, however, I would try to use the utf8_decode() function instead because it is possible that iconv is not installed...
Iconv, however, is more powerful - it can do transliteration for an example. But for this purpose I believe utf8_decode() is enough.
I'm looking into how characters are handled that are outside of the set characterset for a page.
In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.
As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed.
I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.
Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...
htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.
A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).
He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).
btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.
Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.
using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.
Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.
You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.