Charset - json_encode with utf-8 - php

I know there are many questions to this problem and I've read most of them, of course including 'UTF-8 all the way through'.
Following those examples and hints I reduced everything to this minimal example - which unfortunately still won't print a german umlaut ö after json_encoding an array:
(and here is the question - why? what else can I do?)
<?php
error_reporting(E_ALL);
header('Content-Type: text/html; charset=UTF-8');
?>
<!DOCTYPE html>
<html lang="de">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<?php
echo "<br>ini_get('default_charset') ". ini_get('default_charset')."<br>"; // nothing shown
// if (!ini_set('default_charset', 'utf-8')) { // won't work (I guess I'm not allowed to do that)
// echo "could not set default_charset to utf-8<br>";
// }
echo "Köln"; // yay! displays "Köln" as expected
$darr = Array();
$locationString = mb_convert_encoding("location", "UTF-8");
$darr[$locationString] = mb_convert_encoding("Köln", "UTF-8");
$json = json_encode($darr);
echo $json;
// output:
// {"plain":"K\u00f6ln","utf_encode":"K\u00c3\u00b6ln","utf_decode":"K"}
// dah? why?
$array = json_decode($json);
var_dump($array);
// ... even worse: "Köln"
phpinfo();
?>
</body>
</html>
relevant system info:
php 5.2.5 (yeah, I know. I can't change it)
from phpinfo():
default_charset no value
json
json support enabled
json version 1.2.1
mbstring
Multibyte Support enabled
Multibyte string engine libmbfl
mbstring.encoding_translation Off Off
Could this be my problem?
...and yes, the php file is encoded utf-8 (without BOM) in sublimeText. Submitted to server via FileZilla once as ASCII, once Binary, no change.

When encoding unicode data with json_encode you should use the JSON_UNESCAPED_UNICODE flag:
$json = json_encode($darr, JSON_UNESCAPED_UNICODE);
The above is available since php 5.4.0.
For older versions you can try and use this function instead:
function unicode_json_encode($arr) {
//convmap since 0x80 char codes so it takes all multibyte codes (above ASCII 127). So such characters are being "hidden" from normal json_encoding
array_walk_recursive($arr, function (&$item, $key) { if (is_string($item)) $item = mb_encode_numericentity($item, array (0x80, 0xffff, 0, 0xffff), 'UTF-8'); });
return mb_decode_numericentity(json_encode($arr), array (0x80, 0xffff, 0, 0xffff), 'UTF-8');
}
The above function was taken from the comments in json_encode page in php.net

You simply haven't told PHP not to escape the characters when it encodes the data as JSON.
From the manual:
JSON_UNESCAPED_UNICODE (integer)
Encode multibyte Unicode characters literally (default is to escape as \uXXXX). Available since PHP 5.4.0.
So:
$array = json_decode($json, JSON_UNESCAPED_UNICODE);

Related

PHP: Convert Extended Ascii file to UTF-8

i don't have any chance to get a valid utf-8 as output...
$fx = file_get_contents("Extended Ascii file.txt"); // example only has chr(129), but could be mixed Extended Ascii + UTF8
// not working:
//$fx = html_entity_decode($fx, ENT_QUOTES, "UTF-8");
//$fx = mb_convert_encoding($fx, 'UTF-8', 'ASCII');
//$fx = utf8_encode($fx);
//$fx = iconv('ASCII', 'UTF-8//IGNORE', $fx);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(129)"=>"�"
$fx = strtr($fx, [chr(128)=>'Ç',chr(129)=>'ü',chr(130)=>'é',chr(131)=>'â',chr(132)=>'ä',chr(133)=>'à',chr(134)=>'å',chr(135)=>'ç',chr(136)=>'ê',chr(137)=>'ë',chr(138)=>'è',chr(139)=>'ï',chr(140)=>'î',chr(141)=>'ì',chr(142)=>'Ä',chr(143)=>'Å',chr(144)=>'É',chr(145)=>'æ',chr(146)=>'Æ',chr(147)=>'ô',chr(148)=>'ö',chr(149)=>'ò',chr(150)=>'û',chr(151)=>'ù',chr(152)=>'ÿ',chr(153)=>'Ö',chr(154)=>'Ü',chr(155)=>'ø',chr(156)=>'£',chr(157)=>'Ø',chr(158)=>'×',chr(159)=>'ƒ',chr(160)=>'á',chr(161)=>'í',chr(162)=>'ó',chr(163)=>'ú',chr(164)=>'ñ',chr(165)=>'Ñ',chr(166)=>'ª',chr(167)=>'º',chr(168)=>'¿',chr(169)=>'®',chr(170)=>'¬',chr(171)=>'½',chr(172)=>'¼',chr(173)=>'¡',chr(174)=>'«',chr(175)=>'»',chr(176)=>'░',chr(177)=>'▒',chr(178)=>'▓',chr(179)=>'│',chr(180)=>'┤',chr(181)=>'Á',chr(182)=>'Â',chr(183)=>'À',chr(184)=>'©',chr(185)=>'╣',chr(186)=>'║',chr(187)=>'╗',chr(188)=>'╝',chr(189)=>'¢',chr(190)=>'¥',chr(191)=>'┐',chr(192)=>'└',chr(193)=>'┴',chr(194)=>'┬',chr(195)=>'├',chr(196)=>'─',chr(197)=>'┼',chr(198)=>'ã',chr(199)=>'Ã',chr(200)=>'╚',chr(201)=>'╔',chr(202)=>'╩',chr(203)=>'╦',chr(204)=>'╠',chr(205)=>'═',chr(206)=>'╬',chr(207)=>'¤',chr(208)=>'ð',chr(209)=>'Ð',chr(210)=>'Ê',chr(211)=>'Ë',chr(212)=>'È',chr(213)=>'ı',chr(214)=>'Í',chr(215)=>'Î',chr(216)=>'Ï',chr(217)=>'┘',chr(218)=>'┌',chr(219)=>'█',chr(220)=>'▄',chr(221)=>'¦',chr(222)=>'Ì',chr(223)=>'▀',chr(224)=>'Ó',chr(225)=>'ß',chr(226)=>'Ô',chr(227)=>'Ò',chr(228)=>'õ',chr(229)=>'Õ',chr(230)=>'µ',chr(231)=>'þ',chr(232)=>'Þ',chr(233)=>'Ú',chr(234)=>'Û',chr(235)=>'Ù',chr(236)=>'ý',chr(237)=>'Ý',chr(238)=>'¯',chr(239)=>'´',chr(240)=>'≡',chr(241)=>'±',chr(242)=>'‗',chr(243)=>'¾',chr(244)=>'¶',chr(245)=>'§',chr(246)=>'÷',chr(247)=>'¸',chr(248)=>'°',chr(249)=>'¨',chr(250)=>'·',chr(251)=>'¹',chr(252)=>'³',chr(253)=>'²',chr(254)=>'■',chr(255)=>'nbsp']);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(195)"=>"�"
How to convert or remove � ?
28.05.2020 Update: Solution found, thanks to Andrea Pollini!
Some notes:
iconv('UTF-8', 'UTF-8//IGNORE', $fx); // IGNORE is broken in PHP since - https://www.php.net/manual/en/function.iconv.php#108643 - use mb_convert_encoding
Here was my real problem (i figured it out later after many tests):
$P["T"] .= $text; // here was the problem, array is converting strings... (don't know why?)
changed to:
ini_set('mbstring.substitute_character', "none"); // mb_convert_encoding set remove unknown
$P["T"] .= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Now it's working. But if somebody knows why arrays are converting strings and how to disable that, would be great. :)
first configure in order to discard extended characters
<?php
ini_set('mbstring.substitute_character', "none");
?>
next you can use mb_convert_encoding
mb_convert_encoding($fx, "UTF-8", mb_detect_encoding($fx, "UTF-8, ISO-8859-1, ISO-8859-15", true));
you can add the encoding you need in mb_detect_encoding

HTML Special Characters (foreign languages)

Basically I have this string:
Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體)
And I want to convert it into all possible HTML entities (there might be russian characters too!).
I've tried to make different "htmlspecialchars" and "htmlentities" function with different charsets but it returns empty strings...
$l = htmlentities("Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體) €", ENT_COMPAT, "BIG5-HKSCS");
$l = htmlentities($l, ENT_COMPAT, "KOI8-R");
$l = htmlentities($l, ENT_COMPAT, "EUC-JP");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
echo $l;
returns an empty string.
Any help?
Here's my "unutf8" function, which converts all UTF8 characters into HTML entities of the form 〹
function unutf8($str) {
return preg_replace_callback("([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3}|[\xF8-\xFB][\x80-\xBF]{4}|[\xFC-\xFD][\x80-\xBF]{5})",
function($m) {
$c = $m[0];
$out = bindec(ltrim(decbin(ord($c[0])),"1"));
$l = strlen($c);
for( $i=1; $i<$l; $i++) {
$out = ($out<<6) | bindec(ltrim(decbin(ord($c[$i])),"1"));
}
if( $out < 256) return chr($out);
return "&#".$out.";";
},$str);
}
It parses the string for valid UTF8 character sequences and converts the multi-byte sequence into the ordinal value of the character. It's very messy and I don't expect to win any awards for good coding with this, but it works.
Please note, however, that if you have unencoded characters then you WILL run into problems. For example, if for some reason you have é©© then the result will be 驩. Please make sure your string is valid UTF8 before passing it to the function.
Use header to modify the HTTP header to utf-8:
header('Content-Type: text/html; charset=utf-8');
Also, make sure your HTML document is also in utf-8:
<meta http-equiv="Content-type" content="text/html" charset="utf-8" />
Don't go for tough solutions and just follow this small and simple steps :
1) mysql_set_charset("utf8", $conn); set this with your config connection code.
or
2) mysql_query("SET NAMES 'UTF8'");
enter your query here........
mysql_set_charset("UTF8", queryResult);

Trouble with decode JSON + PHP

My php script gives out this string (for example) for JSON:
{"time":"0:38:01","kto":"\u00d3\u00e1\u00e8\u00e2\u00f6\u00e0 \u00c3\u00e5\u00ed\u00e5\u00f0\u00e0\u00eb\u00ee\u00e2","mess":"\u00c5\u00e4\u00e8\u00ed\u00fb\u00e9: *mm"}
jQuery code gets this string through JSON:
$.getJSON('chat_ajax.php?q=1',
function(result) {
alert('Time ' + result.time + ' Kto' + result.kto + ' Mess' + result.mess);
});
Browser show:
0:38:01 Óáèâöà Ãåíåðàëîâ
Åäèíûé: *mm
How can I decode this string to cyrillic?
Try use:
<META http-equiv="content-type" content="text/html; charset=windows-1251">
but nothing change
PHP Code:
$res1=mysqli_query($dbc, "SELECT * FROM chat ORDER BY id DESC LIMIT 1");
while ($row1=mysqli_fetch_array($res1)) {
$rawArray=array('time' => #date("G:i:s", ($row1['time'] + $plus)), 'kto' => $row1[kto], 'mess' => $row1[mess]);
$encodedArray = array_map(utf8_encode, $rawArray);
echo json_encode($encodedArray);
PHP ver 5.3.19
\uXXXX stands for unicode characters and in unicode 00d3 is Ó and so on. Unicode characters are unambigouos, so the character encoding of the page is ignored for them. You could use the correct unicode (i.e. \u0443 for У) or write your script so that it outputs the real characters in Windows-1251 instead of unicode sequences.
Update
I see from your comment that you fetch this data from MySQL and use json_encode() to output it. json_encode only works for UTF-8 encoded data (and d3 is Ó in UTF-8 as well, this is why you get the wrong unicode sequences).
So, you will have to convert all data from Windows-1251 to UTF-8 before passing it to json_encode, then everything else will work fine.
Converting:
$utf8Array = array_map(function($in) {
return iconv('Windows-1251', 'UTF-8', $in);
}, $rawArray);
utf8_encode will not work because it is only useful for input in ISO-8859-1 encoding.
I had similar problem when storing json datas in MySQL BDD : this solved the problem :
json_encode($json_data, JSON_UNESCAPED_UNICODE) ;

Php/json: decode utf8?

I store a json string that contains some (chinese ?) characters in a mysql database.
Example of what's in the database:
normal.text.\u8bf1\u60d1.rest.of.text
On my PHP page I just do a json_decode of what I receive from mysql, but it doesn't display right, it shows things like "½±è§�"
I've tried to execute the "SET NAMES 'utf8'" query at the beginning of my file, didn't change anything.
I already have the following header on my webpage:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
And of course all my php files are encoded in UTF-8.
Do you have any idea how to display these "\uXXXX" characters nicely?
This seems to work fine for me, with PHP 5.3.5 on Ubuntu 11.04:
<?php
header('Content-Type: text/plain; charset="UTF-8"');
$json = '[ "normal.text.\u8bf1\u60d1.rest.of.text" ]';
$decoded = json_decode($json, true);
var_dump($decoded);
Outputs this:
array(1) {
[0]=>
string(31) "normal.text.诱惑.rest.of.text"
}
Unicode is not UTF-8!
$ echo -en '\x8b\xf1\x60\xd1\x00\n' | iconv -f unicodebig -t utf-8
诱惑
This is a strange "encoding" you have. I guess each character of the normal text is "one byte" long (US-ASCII)? Then you have to extract the \u.... sequences, convert the sequence in a "two byte" character and convert that character with iconv("unicodebig", "utf-8", $character) to an UTF-8 character (see iconv in the PHP-documentation). This worked on my side:
$in = "normal.text.\u8bf1\u60d1.rest.of.text";
function ewchar_to_utf8($matches) {
$ewchar = $matches[1];
$binwchar = hexdec($ewchar);
$wchar = chr(($binwchar >> 8) & 0xFF) . chr(($binwchar) & 0xFF);
return iconv("unicodebig", "utf-8", $wchar);
}
function special_unicode_to_utf8($str) {
return preg_replace_callback("/\\\u([[:xdigit:]]{4})/i", "ewchar_to_utf8", $str);
}
echo special_unicode_to_utf8($in);
Otherwise we need more Information on how your string in the database is encoded.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
That's a red herring. If you serve your page over http, and the response contains a Content-Type header, then the meta tag will be ignored. By default, PHP will set such a header, if you don't do it explicitly. And the default is set as iso-8859-1.
Try with this line:
<?php
header("Content-Type: text/html; charset=UTF-8");

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

I am reading an rss feed http://beersandbeans.com/feed/
The feeds says it is UTF8 format, and I am using simplepie rss to import the content When i grab the content and store it in $content I perform the following:
<?php
header ('Content-type: text/html; charset=utf-8');
?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head><body>
<?php
echo $content;
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
echo $content = mb_convert_encoding($content, "UTF-8", $enc);
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
?>
</body></html>
This then produces:
..... Camping: 2,000isk/day for 5 days) = $89 .....
ISO-8859-1
..... Camping: Â Â 2,000isk/day for 5 days) = $89 .....
UTF-8
Why is it outputting the  ?
Try not specifying "UTF-8,ISO-8859-1" and see what encoding it gives you. It might be detecting ISO-8859-1 because it's the last one in that list, rather than the actual encoding of the string.
Set strict-mode to true in mb_detect_encoding(), see http://www.php.net/manual/de/function.mb-detect-encoding.php#102510
Also try http://www.php.net/manual/de/function.mb-convert-encoding.php instead of iconv()

Categories