UTF8 to latin1_swedish_ci - php

There are a lot of topics about latin1_swedisch_ci to utf8 conversion. But what about the other way around? I'm dealing for quite a long time with this problem and I haven't found a solution so far. Since I don't know what else is accessing this database, I don't want to change the character encoding of the table.
I have in the table a column which is formatted in latin1_swedisch_ci. Now I have to write queries in php. This database contains German and French names, meaning that I have characters like ö,ä,ô and so on. How can I do that?
As an example if I want to query the name 'Bürki', then I have to write something like $name='Bürki'. Is there a proper way to convert it to latin1_swedisch_ci without using string replacement for those special characters?

iconv() will convert strings from one encoding to the other.
The encodings that are of interest to you are utf-8 and iso-8859-1 - the latter is equivalent with latin1.
The "swedish", "german" etc. localizations affect issues like sorting only, the character encoding is always the same.
PS.
then I have to write something like $name='Bürki'.
If you encode your source file as UTF-8, you can write Bürki directly. (You would then have to convert that string into iso-8859-1)

I agree with Pekka, however, I would try to use the utf8_decode() function instead because it is possible that iconv is not installed...
Iconv, however, is more powerful - it can do transliteration for an example. But for this purpose I believe utf8_decode() is enough.

Related

When is the correct time to use utf8_encode and utf8_decode?

Character encoding has always been a problem for me. I don't really get when the correct time to use it is.
All the databases I use now I set up with utf8_general_ci, as that seems to a good 'general' start. I have since learned in the past five minutes that it is case insensitive. So that's helpful.
But my question is when to use utf8_encode and utf8_decode ? As far as I can see now, If I $_POST a form from a table on my website, I need to utf8_encode() the value before I insert it into the database.
Then when I pull it out, I need to utf8_decode it. Is that the case? Or am I missing something?
utf8_encode and _decode are pretty bad misnomers. The only thing these functions do is convert between UTF-8 and ISO-8859-1 encodings. They do exactly the same thing as iconv('ISO-8859-1', 'UTF-8', $str) and iconv('UTF-8', 'ISO-8859-1', $str) respectively. There's no other magic going on which would necessitate their use.
If you receive a UTF-8 encoded string from the browser and you want to insert it as UTF-8 into the database using a database connection with the utf8 charset set, there is absolutely no use for either function anywhere in this chain. You are not interested in converting encodings at all here, and that should be the goal.
The only time you could use either function is if you need to convert from UTF-8 to ISO-8859-1 or vice versa at any point, because external data is encoded in this encoding or an external system expects data in this encoding. But even then, I'd prefer the explicit use of iconv or mb_convert_encoding, since it makes it more obvious and explicit what is going on. And in this day and age, UTF-8 should be the default go-to encoding you use throughout, so there should be very little need for such conversion.
See:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
UTF-8 all the way through
Basically utf8_encode is used for Encodes an ISO-8859-1 string to UTF-8.
When you are working on translation like One language to Another language than you have to use this function to prevent to show some garbage Characters.
Like When you display spanish character than some time script doesn't recognize spanish character and it will display some garbage character instead of spanish character.
At that time you can use.
For more refer about this please follow this link :
http://php.net/manual/en/function.utf8-encode.php

How can I store every character in a MySQL database?

I want to be able to store every character possible (Chinese, Arabic, these kind of characters: ☺♀☻) in a MySQL database and also be able to use them in PHP and HTML. how do I do this?
Edit: when I use the function htmlspecialchars() with those characters: ☺♀☻ like this: htmlspecialchars('☺♀☻', ENT_SUBSTITUTE, 'UTF-8'); it returns some seemingly random characters. how do I solve this?
Use UTF-8 character encoding for all text/var fields in your database, as well as page encoding. Be sure to use multibyte (mb_*) forms of text functions, such as mb_substr().
Pick a character set that has the characters you want. utf-8 is very broad most commonly used.
Storing the characters is not so much a problem since it's all just binary data. If you also want the text to be searchable then picking the right collation is useful. utf8_general_ci is fine.

iconv() Vs. utf8_encode()

when you have a charset different of UTF-8 and you need to put it on JSON format to migrate it to a DB, there are two methods that can be used in PHP, calling utf8_encode() and iconv(). I would like to know which one have better performance, and when is convenient to use one or another.
when you have a charset different of UTF-8
Nope - utf8_encode() is suitable only for converting a ISO-8859-1 string to UTF-8. Iconv provides a vast number of source and target encodings.
Re performance, I have no idea how utf8_encode() works internally and what libraries it uses, but my prediction is there won't be much of a difference - at least not on "normal" amounts of data in the bytes or kilobytes. If in doubt, do a benchmark.
I tend to use iconv() because it's clearer that there is a conversion from character set A to character set B.
Also, iconv() provides more detailed control on what to do when it encounters invalid data. Adding //IGNORE to the target character set will cause it to silently drop invalid characters. This may be helpful in certain situations.
I recommend you to write your own function.
It will be 2-3 lines long and it will be better than struggling with locale, iconv etc. issues.
For example:
Fix Turkish Charset Issue Html / PHP (iconv?)

How do browsers/PHP handle characters outside the set characterset?

I'm looking into how characters are handled that are outside of the set characterset for a page.
In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.
As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed.
I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.
Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...
htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.
A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).
He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).
btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.
Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.
using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.
Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.
You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

Is PHP's json_encode guaranteed to produce ASCII string?

Well, the subject says everything. I'm using json_encode to convert some UTF8 data to JSON and I need to transfer it to some layer that is currently ASCII-only. So I wonder whether I need to make it UTF-8 aware, or can I leave it as it is.
Looking at JSON rfc, UTF8 is also valid charset in JSON output, although not recommended, i.e. some implemenatations can leave UTF8 data inside. The question is whether PHP's implementation dumps everthing as ASCII or opts to leave something as UTF-8.
Unlike JSON support in other languages, json_encode() does not have the ability to generate anything other than ASCII.
According to the JSON article in Wikipedia, Unicode characters in strings are always
double-quoted Unicode with backslash escaping
The examples in the PHP Manual on json_encode() seem to confirm this.
So any UTF-8 character outside ASCII/ANSI should be escaped like this: \u0027 (note, as #Ignacio points out in the comments, that this is the recommended way to deal with those characters, not a required one)
However, I suppose json_decode() will convert the characters back to their byte values? You may get in trouble there.
If you need to be sure, take a look at iconv() that could convert your UTF-8 String into ASCII (dropping any unsupported characters) beforehand.
Well, json_encode returns a string. According to the PHP documentation for string:
A string is series of characters. Before PHP 6, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality.
So for the time being you do not need to worry about making it UTF-8 aware. Of course you still might want to think about this anyway, to future-proof your code.

Categories