JSON encode unicode issue on PHP5.3 - php

a string in hebrew after json_encode looks like this:
[{"id":"1","value":"\u05d1\u05dc\u05d0\u05d2\u05df"}
any Idea what encoding is this and how do I get this to either work or be readable again?
BTW, this is a Joomla system which runs on PHP 5.3, string is from post request, not a database and UTF-8 meta tag do exist.

That's just how JSON encodes non-ASCII characters. The text will be readable again when you pass it through a JSON parser.
PHP 5.4 defines a new option for json_encode, JSON_UNESCAPED_UNICODE, that would pass UTF-8 text through as-is without converting it to escape codes. Since you are using PHP 5.3 you can't use it, but if you had 5.4 this is how it would be used:
$json = json_encode($obj, JSON_UNESCAPED_UNICODE); // PHP 5.4 required
However, this should not be needed because the JSON parser will decode the escape codes.

$encoded = json_encode($json);
$unescaped = preg_replace_callback('/\\\\u(\w{4})/', function ($matches) {
return html_entity_decode('&#x' . $matches[1] . ';', ENT_COMPAT,'UTF-8');
}, $encoded);
file_put_contents('sample.json', $unescaped);

Related

Json encode with special char [duplicate]

I have some json I need to decode, alter and then encode without messing up any characters.
If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or-\-or- control-character. But it doesn't work in python either.
{"Tag":"Odómetro"}
I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.
[Tag] => Odómetro
When I encode the array again I the character escaped to ascii, which is correct according to the json spec:
"Tag"=>"Od\u00f3metro"
Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.
Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.
$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...
I have found following way to fix this issue... I hope this can help you.
json_encode($data,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.
Here's why I think so:
json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
You said you had the same problem in Python, which would seem to exclude PHP from being the issue.
PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.
So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).
Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).
JSON_UNESCAPED_UNICODE was added in PHP 5.4 so it looks like you need upgrade your version of PHP to take advantage of it. 5.4 is not released yet though! :(
There is a 5.4 alpha release candidate on QA though if you want to play on your development machine.
A hacky way of doing JSON_UNESCAPED_UNICODE in PHP 5.3. Really disappointed by PHP json support. Maybe this will help someone else.
$array = some_json();
// Encode all string children in the array to html entities.
array_walk_recursive($array, function(&$item, $key) {
if(is_string($item)) {
$item = htmlentities($item);
}
});
$json = json_encode($array);
// Decode the html entities and end up with unicode again.
$json = html_entity_decode($rson);
$json = array('tag' => 'Odómetro'); // Original array
$json = json_encode($json); // {"Tag":"Od\u00f3metro"}
$json = json_decode($json); // Od\u00f3metro becomes Odómetro
echo $json->{'tag'}; // Odómetro
echo utf8_decode($json->{'tag'}); // Odómetro
You were close, just use utf8_decode.
try setting the utf-8 encoding in your page:
header('content-type:text/html;charset=utf-8');
this works for me:
$arr = array('tag' => 'Odómetro');
$encoded = json_encode($arr);
$decoded = json_decode($encoded);
echo $decoded->{'tag'};
Try Using:
utf8_decode() and utf8_encode
To encode an array that contains special characters, ISO 8859-1 to UTF8. (If utf8_encode & utf8_decode is not what is working for you, this might be an option)
Everything that is in ISO-8859-1 should be converted to UTF8:
$utf8 = utf8_encode('이 감사의 마음을 전합니다!'); //contains UTF8 & ISO 8859-1 characters;
$iso88591 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
$data = $iso88591;
Encode should work after this:
$encoded_data = json_encode($data);
Convert UTF-8 to & from ISO 8859-1

Php json_encode converts utf8 string to characters codes

I have a Persian text "سرما"
And then when I convert it to JSON using json_encode(), I get a series of escaped character codes such as \u0633 which seems to be expected and of a rational process. But my confusion lies where I don't know how to convert them back into readable string of characters. How should I do that in PHP?
Should I use anything of mb_* family? I also have checked json_encode() parameters and have found nothing appropriate for me.
UPDATE
what I get saved in my DB is:
["u0633u0631u0645u0627"]
Which shows the characters are not escaped properly. While if I change it to
["\u0633\u0631\u0645\u0627"] it becomes easily readable by json_decode()
They should be converted back on the other end when it's decoded. This is the safest option as it might not be possible to guaranteed that the transmission or storage will not corrupt a multi-byte encoding.
If you're certain that everything is safe for UTF8 end-to-end you can do:
$res = json_encode($foo, \JSON_UNESCAPED_UNICODE);
http://php.net/manual/en/function.json-encode.php
Maybe try encoding the unicode characters, and then json_encoding it, then on the other side (receiving JSON) decode the json, then decode the unicode.
Example:
//Encode
json_encode(utf8_encode($string));
//Decode
utf8_decode(json_decode($string));
its simple just use JSON_UNESCAPED_SLASHES atribute
your problem is't utf8 you need force JSON to don't escape Slashes
example
$bar = "سرما";
$res = json_encode($bar, JSON_UNESCAPED_SLASHES );
// $res equal to ["\u0633\u0631\u0645\u0627"]
if you check the result in your MYSQL Database
it happen when you did't Use addslashes()
example
$bar = "سرما";
$res = json_encode($bar, JSON_UNESCAPED_SLASHES );
$res = addslashes($res);
// $res equal to ["\\u0633\\u0631\\u0645\\u0627"] now it's ready to use in MYSQL

Encoding string with non-ascii characters

I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.
My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.
Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.
This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1

How to produce JSON - un-escaped unicodes in php 5.3.x [duplicate]

When I use json_encode to encode my multi lingual strings , It also changes special characters.What should I do to keep them same .
For example
<?
echo json_encode(array('şüğçö'));
It returns something like ["\u015f\u00fc\u011f\u00e7\u00f6"]
But I want ["şüğçö"]
try it:
<?
echo json_encode(array('şüğçö'), JSON_UNESCAPED_UNICODE);
In JSON any character in strings may be represented by a Unicode escape sequence. Thus "\u015f\u00fc\u011f\u00e7\u00f6" is semantically equal to "şüğçö".
Although those character can also be used plain, json_encode probably prefers the Unicode escape sequences to avoid character encoding issues.
PHP 5.4 adds the option JSON_UNESCAPED_UNICODE, which does what you want. Note that json_encode always outputs UTF-8.
You shouldn't want this
It's definitely possible, even without PHP 5.4.
First, use json_encode() to encode the string and save it in a variable.
Then simply use preg_replace() to replace all \uxxxx with unicode again.
json_encode() does not provide any options for choosing the charset the encoding is in in versions prior to 5.4.
<?php
print_r(json_decode(json_encode(array('şüğçö'))));
/*
Array
(
[0] => şüğçö
)
*/
So do you really need to keep these characters unescaped in the JSON?
Json_encode charset solution for PHP 5.3.3
As JSON_UNESCAPED_UNICODE is not working in PHP 5.3.3 so we have used this method and it is working.
$data = array(
'text' => 'Päiväkampanjat'
);
$json_encode = json_encode($data);
var_dump($json_encode); // text: "P\u00e4iv\u00e4kampanjat"
$unescaped_data = preg_replace_callback('/\\\\u(\w{4})/', function ($matches) {
return html_entity_decode('&#x' . $matches[1] . ';', ENT_COMPAT, 'UTF-8');
}, $json_encode);
var_dump($unescaped); // text is unescaped -> Päiväkampanjat

Json_encode Charset problem

When I use json_encode to encode my multi lingual strings , It also changes special characters.What should I do to keep them same .
For example
<?
echo json_encode(array('şüğçö'));
It returns something like ["\u015f\u00fc\u011f\u00e7\u00f6"]
But I want ["şüğçö"]
try it:
<?
echo json_encode(array('şüğçö'), JSON_UNESCAPED_UNICODE);
In JSON any character in strings may be represented by a Unicode escape sequence. Thus "\u015f\u00fc\u011f\u00e7\u00f6" is semantically equal to "şüğçö".
Although those character can also be used plain, json_encode probably prefers the Unicode escape sequences to avoid character encoding issues.
PHP 5.4 adds the option JSON_UNESCAPED_UNICODE, which does what you want. Note that json_encode always outputs UTF-8.
You shouldn't want this
It's definitely possible, even without PHP 5.4.
First, use json_encode() to encode the string and save it in a variable.
Then simply use preg_replace() to replace all \uxxxx with unicode again.
json_encode() does not provide any options for choosing the charset the encoding is in in versions prior to 5.4.
<?php
print_r(json_decode(json_encode(array('şüğçö'))));
/*
Array
(
[0] => şüğçö
)
*/
So do you really need to keep these characters unescaped in the JSON?
Json_encode charset solution for PHP 5.3.3
As JSON_UNESCAPED_UNICODE is not working in PHP 5.3.3 so we have used this method and it is working.
$data = array(
'text' => 'Päiväkampanjat'
);
$json_encode = json_encode($data);
var_dump($json_encode); // text: "P\u00e4iv\u00e4kampanjat"
$unescaped_data = preg_replace_callback('/\\\\u(\w{4})/', function ($matches) {
return html_entity_decode('&#x' . $matches[1] . ';', ENT_COMPAT, 'UTF-8');
}, $json_encode);
var_dump($unescaped); // text is unescaped -> Päiväkampanjat

Categories