UTF-8 issue with PHP's json_decode

UTF-8 issue with PHP's json_decode - php

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Related

What is the correct code to convert to hex in below format?

I am trying to send bengali text as sms using our local carrier api. But they don't support unicode (utf-8) text as post/get parameter. they replied this:
For every Bengali alphabet there is standard HEXDUMP representation
which need to be inserted in message content part.
Like below Bengali word is having below HEXDUMP representation
বাংলাদেশ : 09AC09BE098209B209BE09A609C709B6
So, I tried following two code gathered from SO.
Code-1:
$strBN = 'বাংলাদেশ';
echo bin2hex($strBN);
//it reutrns this value "e0a6ace0a6bee0a682e0a6b2e0a6bee0a6a6e0a787e0a6b6"
Code-2:
$strBN = 'বাংলাদেশ';
echo fToHex($strBN);
function fToHex($string)
{
$strHData = '';
for ($i = 0; $i < strlen($string); $i++)
{
$strHData .= str_pad(dechex(ord($string[$i])), 2, '0', STR_PAD_LEFT);
}
return $strHData;
}
//This also return same value as above "e0a6ace0a6bee0a682e0a6b2e0a6bee0a6a6e0a787e0a6b6"
So, my question is how I can convert that text/string to hexdump as my carrier expected.

The hex dump that you are getting is UTF-8 format, which is a way to represent Unicode characters reliably in a 8-bit stream.
E0 A6 AC E0 A6 BE E0 A6 82 E0 A6 B2 E0 A6 BE E0 A6 A6 E0 A7 87 E0 A6 B6
The example on the other hand is a dump of the UTF-16 (or truncated 16-bit Unicode codepoint) values:
09AC 09BE 0982 09B2 09BE 09A6 09C7 09B6
In your case the solution is to convert to UTF-16 encoding:
echo bin2hex(mb_convert_encoding('বাংলাদেশ', 'UTF-16'));"
> 09ac09be098209b209be09a609c709b6
Note that using Unicode characters in code is unreliable, because the interpretation of the bytes in a string will depend on your system details / editor / compiler or interpreter settings etc.

Real binary write PHP

How do I do something as simple as (in PHP) this code in C:
char buffer[5] = "testing";
FILE* file2 = fopen("data2.bin", "wb");
fwrite(buffer, sizeof buffer, 1, file2);
fclose(file2);
Whenever I try to write a binary file in PHP, it doesn't write in real binary.
Example:
$ptr = fopen("data2.bin", 'wb');
fwrite($ptr, "testing");
fclose($ptr);
I found on internet that I need to use pack() to do this...
What I expected:
testing\9C\00\00
or
7465 7374 696e 679c 0100 00
What I got:
testing412
Thanks

You're making the classic mistake of confusing data with the representation of that data.
Let's say you have a text file. If you open it in Notepad, you'll see the following:
hello
world
This is because Notepad assumes the data is ASCII text. So it takes every byte of raw data, interprets it as an ASCII character, and renders that text to your screen.
Now if you go and open that file with a hex editor, you'll see something entirely different1:
68 65 6c 6c 6f 0d 0a 77 6f 72 6c 64 hello..world
That is because the hex editor instead takes every byte of the raw data, and displays it as a two-character hexadecimal number.
1 - Assuming Windows \r\n line endings and ASCII encoding.
So if you're expecting hexadecimal ASCII output, you need to convert your string to its hexadecimal encoding before writing it (as ASCII text!) to the file.
In PHP, what you're looking for is the bin2hex function which "Returns an ASCII string containing the hexadecimal representation of str." For example:
$str = "Hello world!";
echo bin2hex($str); // output: 48656c6c6f20776f726c6421
Note that the "wb" mode argument doesn't cause any special behavior. It guarantees binary output, not hexadecimal output. I cannot stress enough that there is a difference. The only thing the b really does, is guarantee that line endings will not be converted by the library when reading/writing data.

PHP gettext and non-ANSII charters

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?

Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

htmlentities, htmlspecialchars, and "invalid multibyte sequence"

This question tells me
htmlentities is identical to htmlspecialchars() in all ways, except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.
Sounds like htmlentities is the one I want.
Then this question tells me I need the "UTF-8" argument to get rid of this error:
Invalid multibyte sequence in argument
So, here is my encoding wrapper function (to normalise behaviour across different PHP versions)
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "UTF-8");
}
I am still getting the "multibyte sequence in argument" error.
Here is a sample string which triggers the error, and it's hex encoding:
Jigue à Baptiste
4a 69 67 75 65 20 e0 20 - 42 61 70 74 69 73 74 65
I notice that the à is encoded as 0xe0 but as a single byte which is above 0x80.
What am I doing wrong?

Your string is encoded in ISO-8859-1, not UTF-8. Plain and simple.
function html_entities ($s)
{
return htmlentities ($s, ENT_COMPAT /* ENT_HTML401 */, "ISO-8859-1");
^^^^^^^^^^^^
}

If à is encoded as 0xE0 then you didn't save the file in UTF-8 encoding. 0xE0 is invalid UTF-8. It should be 0xC3 0xA0
Save your file in UTF-8 encoding. Also see UTF-8 all the way through
If you saved it correctly in utf-8, the hex should look like so:
4A 69 67 75 65 20 C3 A0 20 42 61 70 64 69 73 74 65
J i g u e à B a p t i s t e

Strange characters in PHP

This is driving me crazy.
I have this one php file on a test server at work which does not work.. I kept deleting stuff from it till it became
<?
print 'Hello';
?>
it outputs
ï»¿Hello
if I create a new file and copy / paste the same script to it it works!
Why does this one file give me the strange characters all the time?

That's the BOM (Byte Order Mark) you are seeing.
In your editor, there should be a way to force saving without BOM which will remove the problem.

Found it, file -> encoding -> UTF8 with BOM , changed to to UTF :-)
I should ahve asked before wasing time trying to figure it out :-)

Just in case, here is a list of bytes for BOM
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
UTF-32 (BE) 00 00 FE FF
UTF-32 (LE) FF FE 00 00
UTF-7 2B 2F 76, and one of the following bytes: [ 38 | 39 | 2B | 2F ]†
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 optionally followed by FF†

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

UTF-8 issue with PHP's json_decode - php

Related

What is the correct code to convert to hex in below format?

Real binary write PHP

PHP gettext and non-ANSII charters

htmlentities, htmlspecialchars, and "invalid multibyte sequence"

Strange characters in PHP

Categories

Resources