Well, the subject says everything. I'm using json_encode to convert some UTF8 data to JSON and I need to transfer it to some layer that is currently ASCII-only. So I wonder whether I need to make it UTF-8 aware, or can I leave it as it is.
Looking at JSON rfc, UTF8 is also valid charset in JSON output, although not recommended, i.e. some implemenatations can leave UTF8 data inside. The question is whether PHP's implementation dumps everthing as ASCII or opts to leave something as UTF-8.
Unlike JSON support in other languages, json_encode() does not have the ability to generate anything other than ASCII.
According to the JSON article in Wikipedia, Unicode characters in strings are always
double-quoted Unicode with backslash escaping
The examples in the PHP Manual on json_encode() seem to confirm this.
So any UTF-8 character outside ASCII/ANSI should be escaped like this: \u0027 (note, as #Ignacio points out in the comments, that this is the recommended way to deal with those characters, not a required one)
However, I suppose json_decode() will convert the characters back to their byte values? You may get in trouble there.
If you need to be sure, take a look at iconv() that could convert your UTF-8 String into ASCII (dropping any unsupported characters) beforehand.
Well, json_encode returns a string. According to the PHP documentation for string:
A string is series of characters. Before PHP 6, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality.
So for the time being you do not need to worry about making it UTF-8 aware. Of course you still might want to think about this anyway, to future-proof your code.
Related
Character encoding has always been a problem for me. I don't really get when the correct time to use it is.
All the databases I use now I set up with utf8_general_ci, as that seems to a good 'general' start. I have since learned in the past five minutes that it is case insensitive. So that's helpful.
But my question is when to use utf8_encode and utf8_decode ? As far as I can see now, If I $_POST a form from a table on my website, I need to utf8_encode() the value before I insert it into the database.
Then when I pull it out, I need to utf8_decode it. Is that the case? Or am I missing something?
utf8_encode and _decode are pretty bad misnomers. The only thing these functions do is convert between UTF-8 and ISO-8859-1 encodings. They do exactly the same thing as iconv('ISO-8859-1', 'UTF-8', $str) and iconv('UTF-8', 'ISO-8859-1', $str) respectively. There's no other magic going on which would necessitate their use.
If you receive a UTF-8 encoded string from the browser and you want to insert it as UTF-8 into the database using a database connection with the utf8 charset set, there is absolutely no use for either function anywhere in this chain. You are not interested in converting encodings at all here, and that should be the goal.
The only time you could use either function is if you need to convert from UTF-8 to ISO-8859-1 or vice versa at any point, because external data is encoded in this encoding or an external system expects data in this encoding. But even then, I'd prefer the explicit use of iconv or mb_convert_encoding, since it makes it more obvious and explicit what is going on. And in this day and age, UTF-8 should be the default go-to encoding you use throughout, so there should be very little need for such conversion.
See:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
UTF-8 all the way through
Basically utf8_encode is used for Encodes an ISO-8859-1 string to UTF-8.
When you are working on translation like One language to Another language than you have to use this function to prevent to show some garbage Characters.
Like When you display spanish character than some time script doesn't recognize spanish character and it will display some garbage character instead of spanish character.
At that time you can use.
For more refer about this please follow this link :
http://php.net/manual/en/function.utf8-encode.php
I'm trying to pull sport players from my database that are already stored as unicode values. when calling json_encode it gives up when it hits unicode characters in the format i've got:
$values = array('a'=>'BERDYCH, Tomáš','b'=>'FEDERER, Roger');
echo json_encode($values);
the result is
{"a":"BERDYCH, Tom","b":"FEDERER, Roger"}
You can see 'Tom' was cut-off because it reached the unicode characters.
I understand json_encode only handles \uxxxx style characters but the problem is my database of thousands of sporting competitors already contains unicode stored values, so somehow I need to convert á type characters into \uxxxx without doing updates to my data source.
Any ideas?
json_encode() does this when it gets characters that are not valid UTF-8 characters.
If you are fetching data from the database, the most likely reason is that your connection is not UTF-8 encoded, and you are getting ISO-8859-1 data from your queries.
Show your database code for a suggestion how to change this.
I understand json_encode only handles \uxxxx style characters
This is not true. json_encode() outputs Unicode characters encoded this way, but it doesn't expect them in the incoming data.
Your source code and/or the data coming from the database is not encoded in UTF-8. I'd guess it's one of the specialized ISO-8859 encodings, but I'm not sure. When saving your source code, make sure it's saved in UTF-8. When getting data from the database, make sure you're setting the connection to utf8.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App.
To make sure they are UTF8, encode all values in your array
$values = array_map('utf8_encode', $values);
If that doesn't help use mb_detect_encoding() and mb_convert_encoding() to change language specific encoding to UTF8.
It's a c# question, but take a look at Converting Unicode strings to escaped ascii string for an implementation that does this.
There are a lot of topics about latin1_swedisch_ci to utf8 conversion. But what about the other way around? I'm dealing for quite a long time with this problem and I haven't found a solution so far. Since I don't know what else is accessing this database, I don't want to change the character encoding of the table.
I have in the table a column which is formatted in latin1_swedisch_ci. Now I have to write queries in php. This database contains German and French names, meaning that I have characters like ö,ä,ô and so on. How can I do that?
As an example if I want to query the name 'Bürki', then I have to write something like $name='Bürki'. Is there a proper way to convert it to latin1_swedisch_ci without using string replacement for those special characters?
iconv() will convert strings from one encoding to the other.
The encodings that are of interest to you are utf-8 and iso-8859-1 - the latter is equivalent with latin1.
The "swedish", "german" etc. localizations affect issues like sorting only, the character encoding is always the same.
PS.
then I have to write something like $name='Bürki'.
If you encode your source file as UTF-8, you can write Bürki directly. (You would then have to convert that string into iso-8859-1)
I agree with Pekka, however, I would try to use the utf8_decode() function instead because it is possible that iconv is not installed...
Iconv, however, is more powerful - it can do transliteration for an example. But for this purpose I believe utf8_decode() is enough.
I have a situation where after several years of use we are suddenly have some JSON-encoded values that are giving our Perl script fits due to backslashes.
The issues are with accented characters like í and é. An example is Matí encoded as Mat\ud873.
It is unclear what may have changed in the environment. PHP, Perl, and MySQL are involved. The table collation is latin1_swedish_ci and this may have been changed by a co-worker screwing around.
Does this ring any bells for anyone?
The problem here is internationalization on the JavaScript end, not the collation of your DB table. If you had no such problems before, it's likely that no users were inputting international characters before, or the character set of your HTML pages was ISO-8859-1/cp1252 (which would have limited form POST data on the client end.) New users or changed HTML headers could have caused this problem to manifest itself, but the issue is really on the side of the Perl script.
JSON defines strings as double-quoted sets of characters with Unicode escape sequences when more than a 7-bit encoding is necessary. The first 127 ISO-8859-1 characters can be represented as-is, but any extended-ASCII/multi-byte characters will end up as \uXXXX values. For example, character é (e-acute), which is #233 in ISO-8859-1 will show up as \u00E9 (since é is U+00E9 in Unicode), and the string "résumé" would be stored as "r\u00E9sum\u00E9".
Not knowing what your Perl script is attempting to do, all I can say is it may be experiencing difficulty when trying to de-reference the escape sequence. Perl has its own set of escape sequences, and \u mid-string actually means "make the next character upper-case", so you're probably seeing a lot of "00E9" stuff from your Perl script instead of the accented characters, or you may get parse errors depending on your script.
Since you're creating/storing the JSON from POST data in PHP, you have some options:
Convert the special characters to HTML entities (htmlentities())
Force all special characters to reduce from UTF-8 sequences (if that's what your POST data comes in) to ISO-8859-1 via utf8_decode() (you may lose data with this approach)
Scrub the resultant JSON by replacing this REGEX match: /\\u[a-zA-Z0-9]{4,4}/ with "" (nothing) (you may lose data with this approach)
Double-escape the resultant JSON by changing all "\" characters to "\\" before feeding it to your Perl script (be wary of SQL injection!)
How do I know the string is mb string? so we use mb_strlen instead of strlen ?
You need to always know what encoding a string is in, and whether it is a multibyte one. After all, you need to pass the string's encoding as the second parameter to mb_strlen() to get reliable results, right?
The encoding of incoming data will always be defined in some way - the page's encoding when processing form data; the database connection's and tables' encoding when processing database data; and so on. It is your job to build the flow in a way that you always know what is in what encoding where.
The only exception is when you're dealing with arbitrary third party data that don't declare their content's encoding properly. It is then (and only then) when it's okay to employ sniffing functions like mb-detect-encoding() and colleagues. Remember that those functions are very error-prone and can give you only an educated guess what encoding a string is in, not hard reliable info.
No. A string is a string. There is no way to tell if it contains multiple byte characters.
You can guess with something like mb_detect_encoding() but your mileage may vary depending on the charset and encoding. For example, UTF-8 has a very distinct pattern and you will get very good result. But other encodings like GB2312 are really hard to detect.
If you are designing a new protocol or system, it's best to keep the encoding information.
Compare the strlen and the mb_strlen results, and if they do not match, the string contains multibyte characters.
Isn't mb_check_encoding or mb_detect_encoding supposed to be used for that?