php utf8_encode - chars get prepended with \u - php

I've got an ISO string that I fetch from database, and when I utf8_encode it, I get a \u00f6 instead of Ö. This confuses the javascript/html which ajaxes this PHP script. Why is there a \u00f6 instead of Ö? How to get Ö instead?
edit:
Ok, I did some more experimenting and it turns out this is caused by combination of utf8_encode and json_encode. Though if I don't utf8_encode at all, the value will be null in the json.
json_encode(array("city"=>utf8_encode("göteborg")))

utf8_encode doesn't encode characters to \uxxxx, as you figured out yourself it's json_encode doing this. And that's fine, because the JSON format specifies this behavior. If your client properly decodes the JSON string into a Javascript data type, the \uxxxx escapes will be turned into proper Unicode characters.
As for json_encode discarding characters if your string is Latin1 encoded: It's not explicitly stated on the manual page, but Javascript and JSON are entirely Unicode based, so I suspect Latin1 is an invalid and unexpected encoding to use with JSON strings, so it breaks.

How do you print that? javascript natively support \uXXXX encoding, and doing this in javascript:
var x = "\u00f6"; alert(x);
should print out a small ö.
EDIT: According to your code, if you output that directly to the response stream and use the actual response as a variable in js on the client side, you shouldn't care about json_encode at all.
You would just tell the browser that the content is utf8 by setting the content-type header:
header('content-type: text/plain;charset=utf8');
And then the jQuery.data() code would work just fine.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

PHP and Unicode or UTF-8?

My PHP application outputs JSON where special characters are encoded, f.ex. the string "Brøndum" is represented as "Br\u00f8ndum".
Can you tell me which encoding this is, as well as how I get back from "Br\u00f8ndum" to "Brøndum".
I have tried utf8_encode/decode but they don't work as expected.
Thanks!
That's standard JSON unicode escaping.
You get back to the actual character by using a JSON parser. json_decode in the case of PHP.
You can tell PHP not to escape Unicode characters in the first place with the JSON_UNESCAPED_UNICODE flag.
json_encode("Brøndum", JSON_UNESCAPED_UNICODE)
mb_detect_encoding is your function. You just pass it the string and it detects the codification. You can also send it an array with the possibilities (as a regular string like "hello" could potentially be encoded in different codifications.
echo mb_detect_encoding("Br\u00f8ndum");

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

Problem json_encode utf-8 [duplicate]

This question already has answers here:
Any way to return PHP `json_encode` with encode UTF-8 and not Unicode? [duplicate]
(5 answers)
Closed 7 years ago.
I have a problem with json_encode function with special characters.
For example I try this:
$string="Svrček";
echo "ENCODING=".mb_detect_encoding($string); //ENCODING=UTF-8
echo "JSON=".json_encode($string); //JSON="Svr\u010dek"
What can I do to display the string correctly, so JSON="Svrček"?
Thank you very much.
json_encode() is not actually outputting JSON* there. It’s outputting a javascript string. (It outputs JSON when you give it an object or an array to encode.) That’s fine, as a javascript string is what you want.
In javascript (and in JSON), č may be escaped as \u010d. The two are equivalent. So there’s nothing wrong with what json_encode() is doing. It should work fine. I’d be very surprised if this is actually causing you any form of problem. However, if the transfer is safely in a Unicode encoding (UTF-8, usually)†, there’s no need for it either. If you want to turn off the escaping, you can do so thus: json_encode('Svrček', JSON_UNESCAPED_UNICODE). Note that the flag JSON_UNESCAPED_UNICODE was introduced in PHP 5.4.0, and is unavailable in earlier versions.
By the way, contrary to what #onteria_ says, JSON does use UTF-8:
The character encoding of JSON text is always Unicode. UTF-8 is the only encoding that makes sense on the wire, but UTF-16 and UTF-32 are also permitted.
* Or, at least, it's not outputting JSON as defined in RFC 4627. However, there are other definitions of JSON, by which scalar values are allowed.
† JSON may be in UTF-8, UTF-16LE, UTF-16BE, UFT-32LE, or UTF-32BE.
Ok, so, after you make database connection in your php script, put this line, and it should work, at least it solved my problem:
mysql_query('SET CHARACTER SET utf8');
Yes, json_encode escapes non-ascii characters. If you decode it you'll get your original result:
$string="こんにちは";
echo "ENCODING: " . mb_detect_encoding($string) . "\n";
$encoded = json_encode($string);
echo "ENCODED JSON: $encoded\n";
$decoded = json_decode($encoded);
echo "DECODED JSON: $decoded\n";
Output:
ENCODING: UTF-8
ENCODED JSON: "\u3053\u3093\u306b\u3061\u306f"
DECODED JSON: こんにちは
EDIT: It's worth nothing that:
JSON uses Unicode exclusively.
The self-documenting format that
describes structure and field names as
well as specific values;
Source: http://www.json.org/fatfree.html
It uses Unicode NOT UTF-8. This FAQ Explains the difference between UTF-8 and Unicode:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
You use JSON, your non-ascii characters get escaped into Unicode code points. For example こ = code point 3053.

replace eurosign in json

Can anyone help me with this one
I have this query and only after adding the last one wich is indexed against the euro
I get invalid json.
$url = 'http://www.google.com/finance/info?client=ig&q=goog,yhoo,AMS:TOM2';
$response= json_decode($response,true);
The only thing different if I directly echo the output is the questionmark in the json.
What would I use to replace the eurosign in the json return?, - and hopefully that will solve it.
thanks in adv, Richard
The JSON is valid ISO-8859-1, or Latin1. If your application is using some other encoding, say UTF-8, you need to convert the encoding of the response from Latin1 to UTF-8.
json_encode and json_decode expect in/output to be utf-8. PHP defaults to use iso-8859-1 as charset. So you may have to convert. (Note that the euro sign doesn't exist in iso-8859-1).

Categories