Php/json: decode utf8?

Php/json: decode utf8? - php

I store a json string that contains some (chinese ?) characters in a mysql database.
Example of what's in the database:
normal.text.\u8bf1\u60d1.rest.of.text
On my PHP page I just do a json_decode of what I receive from mysql, but it doesn't display right, it shows things like "½±è§�"
I've tried to execute the "SET NAMES 'utf8'" query at the beginning of my file, didn't change anything.
I already have the following header on my webpage:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
And of course all my php files are encoded in UTF-8.
Do you have any idea how to display these "\uXXXX" characters nicely?

This seems to work fine for me, with PHP 5.3.5 on Ubuntu 11.04:
<?php
header('Content-Type: text/plain; charset="UTF-8"');
$json = '[ "normal.text.\u8bf1\u60d1.rest.of.text" ]';
$decoded = json_decode($json, true);
var_dump($decoded);
Outputs this:
array(1) {
[0]=>
string(31) "normal.text.诱惑.rest.of.text"
}

Unicode is not UTF-8!
$ echo -en '\x8b\xf1\x60\xd1\x00\n' | iconv -f unicodebig -t utf-8
诱惑
This is a strange "encoding" you have. I guess each character of the normal text is "one byte" long (US-ASCII)? Then you have to extract the \u.... sequences, convert the sequence in a "two byte" character and convert that character with iconv("unicodebig", "utf-8", $character) to an UTF-8 character (see iconv in the PHP-documentation). This worked on my side:
$in = "normal.text.\u8bf1\u60d1.rest.of.text";
function ewchar_to_utf8($matches) {
$ewchar = $matches[1];
$binwchar = hexdec($ewchar);
$wchar = chr(($binwchar >> 8) & 0xFF) . chr(($binwchar) & 0xFF);
return iconv("unicodebig", "utf-8", $wchar);
}
function special_unicode_to_utf8($str) {
return preg_replace_callback("/\\\u([[:xdigit:]]{4})/i", "ewchar_to_utf8", $str);
}
echo special_unicode_to_utf8($in);
Otherwise we need more Information on how your string in the database is encoded.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
That's a red herring. If you serve your page over http, and the response contains a Content-Type header, then the meta tag will be ignored. By default, PHP will set such a header, if you don't do it explicitly. And the default is set as iso-8859-1.
Try with this line:
<?php
header("Content-Type: text/html; charset=UTF-8");

Related

Charset - json_encode with utf-8

I know there are many questions to this problem and I've read most of them, of course including 'UTF-8 all the way through'.
Following those examples and hints I reduced everything to this minimal example - which unfortunately still won't print a german umlaut ö after json_encoding an array:
(and here is the question - why? what else can I do?)
<?php
error_reporting(E_ALL);
header('Content-Type: text/html; charset=UTF-8');
?>
<!DOCTYPE html>
<html lang="de">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<?php
echo "<br>ini_get('default_charset') ". ini_get('default_charset')."<br>"; // nothing shown
// if (!ini_set('default_charset', 'utf-8')) { // won't work (I guess I'm not allowed to do that)
// echo "could not set default_charset to utf-8<br>";
// }
echo "Köln"; // yay! displays "Köln" as expected
$darr = Array();
$locationString = mb_convert_encoding("location", "UTF-8");
$darr[$locationString] = mb_convert_encoding("Köln", "UTF-8");
$json = json_encode($darr);
echo $json;
// output:
// {"plain":"K\u00f6ln","utf_encode":"K\u00c3\u00b6ln","utf_decode":"K"}
// dah? why?
$array = json_decode($json);
var_dump($array);
// ... even worse: "KÃ¶ln"
phpinfo();
?>
</body>
</html>
relevant system info:
php 5.2.5 (yeah, I know. I can't change it)
from phpinfo():
default_charset no value
json
json support enabled
json version 1.2.1
mbstring
Multibyte Support enabled
Multibyte string engine libmbfl
mbstring.encoding_translation Off Off
Could this be my problem?
...and yes, the php file is encoded utf-8 (without BOM) in sublimeText. Submitted to server via FileZilla once as ASCII, once Binary, no change.

When encoding unicode data with json_encode you should use the JSON_UNESCAPED_UNICODE flag:
$json = json_encode($darr, JSON_UNESCAPED_UNICODE);
The above is available since php 5.4.0.
For older versions you can try and use this function instead:
function unicode_json_encode($arr) {
//convmap since 0x80 char codes so it takes all multibyte codes (above ASCII 127). So such characters are being "hidden" from normal json_encoding
array_walk_recursive($arr, function (&$item, $key) { if (is_string($item)) $item = mb_encode_numericentity($item, array (0x80, 0xffff, 0, 0xffff), 'UTF-8'); });
return mb_decode_numericentity(json_encode($arr), array (0x80, 0xffff, 0, 0xffff), 'UTF-8');
}
The above function was taken from the comments in json_encode page in php.net

You simply haven't told PHP not to escape the characters when it encodes the data as JSON.
From the manual:
JSON_UNESCAPED_UNICODE (integer)
Encode multibyte Unicode characters literally (default is to escape as \uXXXX). Available since PHP 5.4.0.
So:
$array = json_decode($json, JSON_UNESCAPED_UNICODE);

HTML Special Characters (foreign languages)

Basically I have this string:
Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文（繁體)
And I want to convert it into all possible HTML entities (there might be russian characters too!).
I've tried to make different "htmlspecialchars" and "htmlentities" function with different charsets but it returns empty strings...
$l = htmlentities("Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文（繁體） €", ENT_COMPAT, "BIG5-HKSCS");
$l = htmlentities($l, ENT_COMPAT, "KOI8-R");
$l = htmlentities($l, ENT_COMPAT, "EUC-JP");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
echo $l;
returns an empty string.
Any help?

Here's my "unutf8" function, which converts all UTF8 characters into HTML entities of the form 〹
function unutf8($str) {
return preg_replace_callback("([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3}|[\xF8-\xFB][\x80-\xBF]{4}|[\xFC-\xFD][\x80-\xBF]{5})",
function($m) {
$c = $m[0];
$out = bindec(ltrim(decbin(ord($c[0])),"1"));
$l = strlen($c);
for( $i=1; $i<$l; $i++) {
$out = ($out<<6) | bindec(ltrim(decbin(ord($c[$i])),"1"));
}
if( $out < 256) return chr($out);
return "&#".$out.";";
},$str);
}
It parses the string for valid UTF8 character sequences and converts the multi-byte sequence into the ordinal value of the character. It's very messy and I don't expect to win any awards for good coding with this, but it works.
Please note, however, that if you have unencoded characters then you WILL run into problems. For example, if for some reason you have é©© then the result will be 驩. Please make sure your string is valid UTF8 before passing it to the function.

Use header to modify the HTTP header to utf-8:
header('Content-Type: text/html; charset=utf-8');
Also, make sure your HTML document is also in utf-8:
<meta http-equiv="Content-type" content="text/html" charset="utf-8" />

Don't go for tough solutions and just follow this small and simple steps :
1) mysql_set_charset("utf8", $conn); set this with your config connection code.
or
2) mysql_query("SET NAMES 'UTF8'");
enter your query here........
mysql_set_charset("UTF8", queryResult);

UTF8 TEXT coming back with weird symbols

Im storing text in a DB as UTF8.
When a post is sent via JS to my API, such symbols as ö come back as "Ã¶"
My website html is declared as
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My API output is sent out with a header declaring utf-8, like so:
$status_header = 'HTTP/1.1 '.$status.' '.self::getStatusCodeMessage($status);
header($status_header);
header('Content-type: ' . $content_type.'; charset=utf-8');
if ($body !== '') {
echo $body;
The only way I've managed to get round this is by using PHP on my output todo this:
private static function fixText($text) {
$replaceChars = array(
"â€œ" => "\"",
'â€¢' => '·',
"â€" => "\"",
"â€™" => "'",
'Ã¶' => 'ö',
'â€' => "'",
"Ã©" => "é",
"Ã«" => "ë",
"Â£" => "£"
);
foreach($replaceChars as $oldChar => $newChar) {
$text = str_replace($oldChar, $newChar, $text);
}
$text = iconv("UTF-8", "UTF-8//IGNORE", $text);
return $text;
}
Obviously this is not ideal as I have to keep adding more and more symbols to the map.
UPDATE:
A developer had sneakily added this code:
$document->text = mb_convert_encoding($document->text, mb_detect_encoding($document->text), "cp1252");
As a way to overcome old latin characters coming through damaged.

Seeing those funny characters means that you have double-encoded UTF-8 stored. You don't show how you are adding data to the database. If you use utf8_encode() on already UTF-8 encoded strings, this will be your result.
MongoDB only accepts UTF-8 but you should not encoded it yourself again, if you're already gettings UTF-8 send through to you by the webserver.
Instead of:
header('Content-type: ' . $content_type.'; charset=utf-8');
Consider setting the default charset in php.ini:
default_charset=UTF-8

Trouble with decode JSON + PHP

My php script gives out this string (for example) for JSON:
{"time":"0:38:01","kto":"\u00d3\u00e1\u00e8\u00e2\u00f6\u00e0 \u00c3\u00e5\u00ed\u00e5\u00f0\u00e0\u00eb\u00ee\u00e2","mess":"\u00c5\u00e4\u00e8\u00ed\u00fb\u00e9: *mm"}
jQuery code gets this string through JSON:
$.getJSON('chat_ajax.php?q=1',
function(result) {
alert('Time ' + result.time + ' Kto' + result.kto + ' Mess' + result.mess);
});
Browser show:
0:38:01 Óáèâöà Ãåíåðàëîâ
Åäèíûé: *mm
How can I decode this string to cyrillic?
Try use:
<META http-equiv="content-type" content="text/html; charset=windows-1251">
but nothing change
PHP Code:
$res1=mysqli_query($dbc, "SELECT * FROM chat ORDER BY id DESC LIMIT 1");
while ($row1=mysqli_fetch_array($res1)) {
$rawArray=array('time' => #date("G:i:s", ($row1['time'] + $plus)), 'kto' => $row1[kto], 'mess' => $row1[mess]);
$encodedArray = array_map(utf8_encode, $rawArray);
echo json_encode($encodedArray);
PHP ver 5.3.19

\uXXXX stands for unicode characters and in unicode 00d3 is Ó and so on. Unicode characters are unambigouos, so the character encoding of the page is ignored for them. You could use the correct unicode (i.e. \u0443 for У) or write your script so that it outputs the real characters in Windows-1251 instead of unicode sequences.
Update
I see from your comment that you fetch this data from MySQL and use json_encode() to output it. json_encode only works for UTF-8 encoded data (and d3 is Ó in UTF-8 as well, this is why you get the wrong unicode sequences).
So, you will have to convert all data from Windows-1251 to UTF-8 before passing it to json_encode, then everything else will work fine.
Converting:
$utf8Array = array_map(function($in) {
return iconv('Windows-1251', 'UTF-8', $in);
}, $rawArray);
utf8_encode will not work because it is only useful for input in ISO-8859-1 encoding.

I had similar problem when storing json datas in MySQL BDD : this solved the problem :
json_encode($json_data, JSON_UNESCAPED_UNICODE) ;

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

I am reading an rss feed http://beersandbeans.com/feed/
The feeds says it is UTF8 format, and I am using simplepie rss to import the content When i grab the content and store it in $content I perform the following:
<?php
header ('Content-type: text/html; charset=utf-8');
?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head><body>
<?php
echo $content;
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
echo $content = mb_convert_encoding($content, "UTF-8", $enc);
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
?>
</body></html>
This then produces:
..... Camping: 2,000isk/day for 5 days) = $89 .....
ISO-8859-1
..... Camping: Â Â 2,000isk/day for 5 days) = $89 .....
UTF-8
Why is it outputting the Â ?

Try not specifying "UTF-8,ISO-8859-1" and see what encoding it gives you. It might be detecting ISO-8859-1 because it's the last one in that list, rather than the actual encoding of the string.

Set strict-mode to true in mb_detect_encoding(), see http://www.php.net/manual/de/function.mb-detect-encoding.php#102510
Also try http://www.php.net/manual/de/function.mb-convert-encoding.php instead of iconv()

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Php/json: decode utf8? - php

Related

Charset - json_encode with utf-8

HTML Special Characters (foreign languages)

UTF8 TEXT coming back with weird symbols

Trouble with decode JSON + PHP

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

Categories

Resources