Charset of Textfile stored with file_put_contents() is missinterpretated

Charset of Textfile stored with file_put_contents() is missinterpretated - php

To prepare a download of some HTML contenteditable, as plain text file, I do following :
Send the html contenteditable, which inherits other html elements, through Ajax to a server side script prepareDownload.php.
There I create a new DOMDocument : $doc = new DOMDocument();
Then I do : $doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Then I am looking for text contents in certain elements and assemble it in $plainText
Finally I write $plainText to disk with : file_put_contents($txtFile, $plainText, LOCK_EX);
So far it works … but when I open the textfile the special characters like the German Ä are a mess.
To find out where the problem might be generated I place some print_r() commands on several stages in the php script and look into the browsers console whats coming back.
Until the point where I write $plainText with to disk file_put_contents() everything is perfect. Looking into the stored text file then, characters are a mess.
Now I assume that file_put_contents() misinterprets the given charset. But how to tell file_put_contents() that it should interpret (not encode) it as UTF-8 ?
EDIT:
As a test to find out more I replaced the explizit statement :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"])
with
$doc->loadHTML($_POST["data"])
The character ä in the file still looks weired, but different. The hexdump now looks like this :
0220: 20 76 69 65 6C 2C 20 65 72 7A C3 A4 68 6C 74 20 viel, erz..hlt
Now ä has two points (two bytes) and is hex C3 A4. What kind of encoding is this ?

Related

PHP Problems with multibyte strings and using DOMDOCUMENT

Using PHP 5.6.11 - I have a block of HTML that is utf-8 encoded. The multibyte strings are encoded in the text.
Here is one sample of a string:
"You haven’t added"
Viewed with hexdump (see e2 80 99?) on a utf-8 console (linux):
00000000 59 6f 75 20 68 61 76 65 6e e2 80 99 74 20 61 64 |You haven...t ad|
Here it is as html entities:
"You haven’t added"
All this is ok. However when I load it into a domdoc, it comes out again mangled (shown as html entities).
"You haven&acirc;&#128;&#153;t added"
Here is the code to generate this snippet.
$text="<html><body>You haven’t added anything.<br></body></html>";
echo mb_detect_encoding($text)."\n";
$text2= substr($text,strpos($text,"You haven"),20);
echo $text2."\n";
echo htmlentities($text2);
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML($text);
$text2 = $doc->saveHTML();
$text2= substr($text2,strpos($text2,"You haven"),35);
echo "\n".htmlentities($text2)."\n";
The output of this is:
UTF-8
You haven’t added
You haven’t added
You haven&acirc;&#128;&#153;t added
I have tried a variety of ideas, but I can't seem to keep domdoc from mangling either the HTML or the multibyte. Any suggestions?
Edit: If I insert a meta tag it works more as expected.
$text='<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body>You haven’t added anything.<br></body></html>';
Output:
UTF-8
You haven’t added
You haven’t added
You haven’t added anything.<br></
Edit 2:
Inserting the meta tag with charset=utf-8 works fine as well as:
$doc->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'));
Fixes the encoding. I still can't figure out what domdocument is doing with the encoding, I've tried this line at least 3 times earlier but it wasn't working. Perhaps a little time away from the keyboard was needed, because it seems to be working now. I'll update this if there is a problem once I test it on bigger datasets.

UTF-8 issue with PHP's json_decode

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Python xml-rpc with PHP client and unicode not working

I have XML-RPC Server written with Python. It takes some values and saves them in a mysql database. Data are in utf-8 and the whole process works fine.
I have no problem talking to it with other languages like Python and ASP.NET and C#, but when it comes to PHP, there is a problem. The characters are not being saved in MySQL as they should be and they are all scrambled characters.
I have done all the recommendations as setting the header in the PHP file and etc. I have also configured MySQL collation to utf-8, but the problem still exists.
The Curl used is from Github: https://github.com/dongsheng/cURL
Source code is below:
<?php
error_reporting(E_ALL);
header('Content-Type: text/plain; charset=utf-8');
require_once('curl.php');
$rpc = "http://xmlrpc-webservice-address.com/";
$client = new xmlrpc_client($rpc, true);
$text="سلام";
$arr1=array("username", "password", array("111"), $text, "30002240123456", "ws", False);
$resp = $client->call('send', $arr1);
print_r($resp);
print_r("\n");
print_r($text);
?>

Most likely your problem is one of the following (in descending order of likelihood):
Your PHP source code isn't actually encoded as UTF-8, but as, say, CP1256. That means the non-ASCII literal string in the source is actually mojibake nonsense as far as the PHP interpreter, which reads it as UTF-8, is concerned. And those garbage bytes get passed through as-is all the way through the chain—to the XML-RPC service and back, to the browser, and to the user's screen.
Your PHP source code is encoded as UTF-8, but your PHP interpreter thinks it's, say, CP1256, because of the way it (or your server/module) is configured. So, once again, the literal string is mojibake (in the opposite direction), which again passes through the whole chain.
Your web service isn't returning UTF-8, but, say, Latin-1, and your other clients all treat it accordingly as Latin-1, but your PHP client code just assumes it's UTF-8, passes it to the browser as if it were UTF-8, and the user sees garbage.
If you're not absolutely, positively sure that your editor saved the source code as UTF-8, look at the source file in a hex editor. If it's UTF-8, the Arabic string should look like D8 B3 D9 84 D8 A7 D9 85. If it's anything different—like D3 E1 C7 E3 (CP1256) or D3 E3 C7 E5 (ISO-8859-6), that's your problem.

Real binary write PHP

How do I do something as simple as (in PHP) this code in C:
char buffer[5] = "testing";
FILE* file2 = fopen("data2.bin", "wb");
fwrite(buffer, sizeof buffer, 1, file2);
fclose(file2);
Whenever I try to write a binary file in PHP, it doesn't write in real binary.
Example:
$ptr = fopen("data2.bin", 'wb');
fwrite($ptr, "testing");
fclose($ptr);
I found on internet that I need to use pack() to do this...
What I expected:
testing\9C\00\00
or
7465 7374 696e 679c 0100 00
What I got:
testing412
Thanks

You're making the classic mistake of confusing data with the representation of that data.
Let's say you have a text file. If you open it in Notepad, you'll see the following:
hello
world
This is because Notepad assumes the data is ASCII text. So it takes every byte of raw data, interprets it as an ASCII character, and renders that text to your screen.
Now if you go and open that file with a hex editor, you'll see something entirely different1:
68 65 6c 6c 6f 0d 0a 77 6f 72 6c 64 hello..world
That is because the hex editor instead takes every byte of the raw data, and displays it as a two-character hexadecimal number.
1 - Assuming Windows \r\n line endings and ASCII encoding.
So if you're expecting hexadecimal ASCII output, you need to convert your string to its hexadecimal encoding before writing it (as ASCII text!) to the file.
In PHP, what you're looking for is the bin2hex function which "Returns an ASCII string containing the hexadecimal representation of str." For example:
$str = "Hello world!";
echo bin2hex($str); // output: 48656c6c6f20776f726c6421
Note that the "wb" mode argument doesn't cause any special behavior. It guarantees binary output, not hexadecimal output. I cannot stress enough that there is a difference. The only thing the b really does, is guarantee that line endings will not be converted by the library when reading/writing data.

PHP gettext and non-ANSII charters

I have a PHP web application which is originally in Polish. But I was asked to locale it into Russian. I've decided to use gettext. But I've problem when I'm trying to translate string with Polish special characters. For example:
echo gettext('Urządzenie');
Display "Urządzenie" in web browser instead of word in Russian.
All files are encoded in UTF-8 and .po file was generated with --from-code utf-8 . Translations without Polish special chars such as
echo gettext('Instrukcja');
works well. Do you know what could be the reason of this strange behaviour?

Are you sure the PHP file is in UTF-8 format? To verify, try this:
echo bin2hex('Urządzenie');
You should see the following bytes:
55 72 7a c4 85 64 7a 65 6e 69 65

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Charset of Textfile stored with file_put_contents() is missinterpretated - php

Related

PHP Problems with multibyte strings and using DOMDOCUMENT

UTF-8 issue with PHP's json_decode

Python xml-rpc with PHP client and unicode not working

Real binary write PHP

PHP gettext and non-ANSII charters

Categories

Resources