json encode utf8 error - php

i have problem encoding this character with json_encode
http://www.fileformat.info/info/unicode/char/92/index.htm
first it give me this error
JSON_ERROR_UTF8 which is
'Malformed UTF-8 characters, possibly incorrectly encoded'
so tried this function utf8_encode() before json_encode
now return this result '\u0092'
so i found this one
function jsonRemoveUnicodeSequences($struct) {
return preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", json_encode($struct));
}
the character show up but with other one
Â’
also tried htmlentities then html_entity_decode
with no result

json_encode() requires input that is
null
integer, float, boolean
string encoded as UTF-8
objects implementing JsonSerializable (or whatever it's called, I'm too lazy to look it up)
arrays of JSON-encodable objects
stdClass instances of JSON-encodable objects
So, if you have a string, you must first transcode it to UTF-8. The correct tool for that is the iconv library, but you need to know which encoding the string currently has in order to correctly transcode it.
Your approach to recursively transcode arrays or objects should work, but I'd strongly suggest not using anything but UTF-8 internally. If you have an interface where you have to accept different encodings, validate and reject immediately and use UTF-8 onwards. Similarly, when replying, keep UTF-8 until the last possible point where you can still signal encoding problems.

If you look at the link you included to the character U+0092, it is a control character, and it is also known as PRIVATE USE TWO. Its existence in your string means that your string is almost certainly not a UTF-8 string. Instead, it is probably a Windows-specific encoding, likely Windows-1252 if your text is English, in which 0x92 is a "smart quote" apostrophe, also known as a right single quotation mark. The Unicode equivalent of this character is U+2019.
Thus your data source is not giving you UTF-8 text. Either you can fix the source data to be UTF-8 encoded, or you can convert the text you receive. For example, the output of
echo iconv('Windows-1252','UTF-8', "\x92")
is
’
which is probably what you want. However, you want to make sure that all of your input is the same encoding. If some of your data is UTF-8 and some is Windows-1252, the above iconv call will properly handle Windows-1252 encoded apostrophes, but it will convert UTF-8 encoded apostrophes to
’

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

PHP - string encoding

I am receiving as a $_GET parameter a string with "6d617263f2" as hex representation.
As far as I understand character encoding, this is not an UTF-8 string. If I print it with UTF-8 encoding what I get is "marc�". If I convert the string to UTF-8 with utf8_encode I get the correct representation, which is marcò.
I setted all my character encodings (default_carset, iconv and mbstring) in the php.ini file to work with UTF-8. I also have the mbstring.encoding_translation set to On.
I'm not able to fully understand what is going on... why I am not getting my $_GET parameter encoded correctly with UTF-8?
My guesses are:
the client is using another character encoding and if I want to use UTF-8, there is no other way that explicitely convert my parameter to UTF-8
I am missing something somewhere...
Could you please help me to shed some light on this?
If you don't control the origin of that GET parameter, then there's nothing you can do. PHP will give you the string as is and won't automatically convert its encoding. It can't, since it doesn't know what encoding to convert from. There's no spec or anything where anyone could get that information from. You need to specify what encoding you accept strings in. Don't leave it up to the client to decide, because then you have no idea what you're going to get.
If the client sends you ISO-8859 encoded text, but you want it to be UTF-8 encoded internally (a sensible choice BTW), you will simply have to convert its encoding. I'd use iconv('ISO-8859-1', 'UTF-8', $_GET['foo']) for that since it's more explicit what it does, but utf8_encode happens to do exactly the same thing.

Why does json_decode return null after utf8_decode

I am have retrieved a string from data base that contains Unicode characters,
After that I have utf8_decoded them so I can read them clearly,
Then I passed the string to json_decode but it return null!
without utf8_decode the json_decode return an array with é characters.
utf8_decode converts a string's encoding from UTF-8 to ISO-8859-1, a.k.a. Latin-1.
json_decode expects, requires and returns UTF-8 encoded strings.
That's why it's obviously not working.
The string you get from the database is apparently UTF-8 encoded, which is good. You must not convert it to Latin-1 before you decode the JSON. You should also not convert it afterwards, just keep everything in UTF-8. The only problem you have is that you're not correctly instructing your browser to deal with UTF-8. The quick answer is to set a proper HTTP header:
header('Content-Type: text/html; charset=UTF-8');
For the longer and more nuanced answer(s), see UTF-8 all the way through, Handling Unicode Front To Back In A Web App and What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

(PHP) rawurlencode/decode seems to encode '£' sign as '£' (%C2%A3 instead of %A3)

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

Categories