This question already has answers here:
Any way to return PHP `json_encode` with encode UTF-8 and not Unicode? [duplicate]
(5 answers)
Closed 7 years ago.
I have a problem with json_encode function with special characters.
For example I try this:
$string="Svrček";
echo "ENCODING=".mb_detect_encoding($string); //ENCODING=UTF-8
echo "JSON=".json_encode($string); //JSON="Svr\u010dek"
What can I do to display the string correctly, so JSON="Svrček"?
Thank you very much.
json_encode() is not actually outputting JSON* there. It’s outputting a javascript string. (It outputs JSON when you give it an object or an array to encode.) That’s fine, as a javascript string is what you want.
In javascript (and in JSON), č may be escaped as \u010d. The two are equivalent. So there’s nothing wrong with what json_encode() is doing. It should work fine. I’d be very surprised if this is actually causing you any form of problem. However, if the transfer is safely in a Unicode encoding (UTF-8, usually)†, there’s no need for it either. If you want to turn off the escaping, you can do so thus: json_encode('Svrček', JSON_UNESCAPED_UNICODE). Note that the flag JSON_UNESCAPED_UNICODE was introduced in PHP 5.4.0, and is unavailable in earlier versions.
By the way, contrary to what #onteria_ says, JSON does use UTF-8:
The character encoding of JSON text is always Unicode. UTF-8 is the only encoding that makes sense on the wire, but UTF-16 and UTF-32 are also permitted.
* Or, at least, it's not outputting JSON as defined in RFC 4627. However, there are other definitions of JSON, by which scalar values are allowed.
† JSON may be in UTF-8, UTF-16LE, UTF-16BE, UFT-32LE, or UTF-32BE.
Ok, so, after you make database connection in your php script, put this line, and it should work, at least it solved my problem:
mysql_query('SET CHARACTER SET utf8');
Yes, json_encode escapes non-ascii characters. If you decode it you'll get your original result:
$string="こんにちは";
echo "ENCODING: " . mb_detect_encoding($string) . "\n";
$encoded = json_encode($string);
echo "ENCODED JSON: $encoded\n";
$decoded = json_decode($encoded);
echo "DECODED JSON: $decoded\n";
Output:
ENCODING: UTF-8
ENCODED JSON: "\u3053\u3093\u306b\u3061\u306f"
DECODED JSON: こんにちは
EDIT: It's worth nothing that:
JSON uses Unicode exclusively.
The self-documenting format that
describes structure and field names as
well as specific values;
Source: http://www.json.org/fatfree.html
It uses Unicode NOT UTF-8. This FAQ Explains the difference between UTF-8 and Unicode:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
You use JSON, your non-ascii characters get escaped into Unicode code points. For example こ = code point 3053.
Related
I have some json I need to decode, alter and then encode without messing up any characters.
If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or-\-or- control-character. But it doesn't work in python either.
{"Tag":"Odómetro"}
I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.
[Tag] => Odómetro
When I encode the array again I the character escaped to ascii, which is correct according to the json spec:
"Tag"=>"Od\u00f3metro"
Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.
Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.
$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...
I have found following way to fix this issue... I hope this can help you.
json_encode($data,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.
Here's why I think so:
json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
You said you had the same problem in Python, which would seem to exclude PHP from being the issue.
PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.
So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).
Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).
JSON_UNESCAPED_UNICODE was added in PHP 5.4 so it looks like you need upgrade your version of PHP to take advantage of it. 5.4 is not released yet though! :(
There is a 5.4 alpha release candidate on QA though if you want to play on your development machine.
A hacky way of doing JSON_UNESCAPED_UNICODE in PHP 5.3. Really disappointed by PHP json support. Maybe this will help someone else.
$array = some_json();
// Encode all string children in the array to html entities.
array_walk_recursive($array, function(&$item, $key) {
if(is_string($item)) {
$item = htmlentities($item);
}
});
$json = json_encode($array);
// Decode the html entities and end up with unicode again.
$json = html_entity_decode($rson);
$json = array('tag' => 'Odómetro'); // Original array
$json = json_encode($json); // {"Tag":"Od\u00f3metro"}
$json = json_decode($json); // Od\u00f3metro becomes Odómetro
echo $json->{'tag'}; // Odómetro
echo utf8_decode($json->{'tag'}); // Odómetro
You were close, just use utf8_decode.
try setting the utf-8 encoding in your page:
header('content-type:text/html;charset=utf-8');
this works for me:
$arr = array('tag' => 'Odómetro');
$encoded = json_encode($arr);
$decoded = json_decode($encoded);
echo $decoded->{'tag'};
Try Using:
utf8_decode() and utf8_encode
To encode an array that contains special characters, ISO 8859-1 to UTF8. (If utf8_encode & utf8_decode is not what is working for you, this might be an option)
Everything that is in ISO-8859-1 should be converted to UTF8:
$utf8 = utf8_encode('이 감사의 마음을 전합니다!'); //contains UTF8 & ISO 8859-1 characters;
$iso88591 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
$data = $iso88591;
Encode should work after this:
$encoded_data = json_encode($data);
Convert UTF-8 to & from ISO 8859-1
My PHP application outputs JSON where special characters are encoded, f.ex. the string "Brøndum" is represented as "Br\u00f8ndum".
Can you tell me which encoding this is, as well as how I get back from "Br\u00f8ndum" to "Brøndum".
I have tried utf8_encode/decode but they don't work as expected.
Thanks!
That's standard JSON unicode escaping.
You get back to the actual character by using a JSON parser. json_decode in the case of PHP.
You can tell PHP not to escape Unicode characters in the first place with the JSON_UNESCAPED_UNICODE flag.
json_encode("Brøndum", JSON_UNESCAPED_UNICODE)
mb_detect_encoding is your function. You just pass it the string and it detects the codification. You can also send it an array with the possibilities (as a regular string like "hello" could potentially be encoded in different codifications.
echo mb_detect_encoding("Br\u00f8ndum");
i have problem encoding this character with json_encode
http://www.fileformat.info/info/unicode/char/92/index.htm
first it give me this error
JSON_ERROR_UTF8 which is
'Malformed UTF-8 characters, possibly incorrectly encoded'
so tried this function utf8_encode() before json_encode
now return this result '\u0092'
so i found this one
function jsonRemoveUnicodeSequences($struct) {
return preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", json_encode($struct));
}
the character show up but with other one
Â’
also tried htmlentities then html_entity_decode
with no result
json_encode() requires input that is
null
integer, float, boolean
string encoded as UTF-8
objects implementing JsonSerializable (or whatever it's called, I'm too lazy to look it up)
arrays of JSON-encodable objects
stdClass instances of JSON-encodable objects
So, if you have a string, you must first transcode it to UTF-8. The correct tool for that is the iconv library, but you need to know which encoding the string currently has in order to correctly transcode it.
Your approach to recursively transcode arrays or objects should work, but I'd strongly suggest not using anything but UTF-8 internally. If you have an interface where you have to accept different encodings, validate and reject immediately and use UTF-8 onwards. Similarly, when replying, keep UTF-8 until the last possible point where you can still signal encoding problems.
If you look at the link you included to the character U+0092, it is a control character, and it is also known as PRIVATE USE TWO. Its existence in your string means that your string is almost certainly not a UTF-8 string. Instead, it is probably a Windows-specific encoding, likely Windows-1252 if your text is English, in which 0x92 is a "smart quote" apostrophe, also known as a right single quotation mark. The Unicode equivalent of this character is U+2019.
Thus your data source is not giving you UTF-8 text. Either you can fix the source data to be UTF-8 encoded, or you can convert the text you receive. For example, the output of
echo iconv('Windows-1252','UTF-8', "\x92")
is
’
which is probably what you want. However, you want to make sure that all of your input is the same encoding. If some of your data is UTF-8 and some is Windows-1252, the above iconv call will properly handle Windows-1252 encoded apostrophes, but it will convert UTF-8 encoded apostrophes to
’
I've got an ISO string that I fetch from database, and when I utf8_encode it, I get a \u00f6 instead of Ö. This confuses the javascript/html which ajaxes this PHP script. Why is there a \u00f6 instead of Ö? How to get Ö instead?
edit:
Ok, I did some more experimenting and it turns out this is caused by combination of utf8_encode and json_encode. Though if I don't utf8_encode at all, the value will be null in the json.
json_encode(array("city"=>utf8_encode("göteborg")))
utf8_encode doesn't encode characters to \uxxxx, as you figured out yourself it's json_encode doing this. And that's fine, because the JSON format specifies this behavior. If your client properly decodes the JSON string into a Javascript data type, the \uxxxx escapes will be turned into proper Unicode characters.
As for json_encode discarding characters if your string is Latin1 encoded: It's not explicitly stated on the manual page, but Javascript and JSON are entirely Unicode based, so I suspect Latin1 is an invalid and unexpected encoding to use with JSON strings, so it breaks.
How do you print that? javascript natively support \uXXXX encoding, and doing this in javascript:
var x = "\u00f6"; alert(x);
should print out a small ö.
EDIT: According to your code, if you output that directly to the response stream and use the actual response as a variable in js on the client side, you shouldn't care about json_encode at all.
You would just tell the browser that the content is utf8 by setting the content-type header:
header('content-type: text/plain;charset=utf8');
And then the jQuery.data() code would work just fine.
I have a json array which is holding the correct string independent of language but when the json is encoded and wrriten into the file it doesnot have the correct values. Its has the the other value random english alphabets eg:(uuadb) I want to write a string into a file where the string could be in any language.Now i am testing with tamil language. But i found PHP doesn't support unicode. please help me how to write unicode charaters into the file using PHP.
I tried using pack function but how to use the pack function for any languages Or is there any other way of doing this.Please help me......
My guess is that you're seeing \uXXXX escapes instead of the non-ASCII characters you asked for. json_encode appears to always escape Unicode characters:
<?php
$arr = array("♫");
$json = json_encode($arr);
echo "$json\n";
# Prints ["\u266b"]
$str = '["♫"]';
$array = json_decode($str);
echo "{$array[0]}\n";
# Prints ♫
?>
If this is what you're getting, it's not wrong. You just have to ensure it's being decoded properly on the receiving end.
Another possibility is that the string you're passing is not in UTF-8. According to the documentation for json_encode and json_decode, these functions only work with UTF-8 data. Call mb_detect_encoding on your input string, and make sure it outputs either UTF-8 or ASCII.