Real binary write PHP - php

How do I do something as simple as (in PHP) this code in C:
char buffer[5] = "testing";
FILE* file2 = fopen("data2.bin", "wb");
fwrite(buffer, sizeof buffer, 1, file2);
fclose(file2);
Whenever I try to write a binary file in PHP, it doesn't write in real binary.
Example:
$ptr = fopen("data2.bin", 'wb');
fwrite($ptr, "testing");
fclose($ptr);
I found on internet that I need to use pack() to do this...
What I expected:
testing\9C\00\00
or
7465 7374 696e 679c 0100 00
What I got:
testing412
Thanks

You're making the classic mistake of confusing data with the representation of that data.
Let's say you have a text file. If you open it in Notepad, you'll see the following:
hello
world
This is because Notepad assumes the data is ASCII text. So it takes every byte of raw data, interprets it as an ASCII character, and renders that text to your screen.
Now if you go and open that file with a hex editor, you'll see something entirely different1:
68 65 6c 6c 6f 0d 0a 77 6f 72 6c 64 hello..world
That is because the hex editor instead takes every byte of the raw data, and displays it as a two-character hexadecimal number.
1 - Assuming Windows \r\n line endings and ASCII encoding.
So if you're expecting hexadecimal ASCII output, you need to convert your string to its hexadecimal encoding before writing it (as ASCII text!) to the file.
In PHP, what you're looking for is the bin2hex function which "Returns an ASCII string containing the hexadecimal representation of str." For example:
$str = "Hello world!";
echo bin2hex($str); // output: 48656c6c6f20776f726c6421
Note that the "wb" mode argument doesn't cause any special behavior. It guarantees binary output, not hexadecimal output. I cannot stress enough that there is a difference. The only thing the b really does, is guarantee that line endings will not be converted by the library when reading/writing data.

Related

PHP pack and unpack feature

When I run something like pack('N', "123455") or any variation of the 'N' option, I always get a character returned. The above example returns �?.
I am trying to work with Clamd and streaming to the socket and it needs "4 bytes unsigned integer in network byte order". I simply cannot get it to work.
echo'ing binary data will pretty much always output something that looks like that. Binary data is not meant to be read and understood by humans.
$binary = pack('N', "123455");
$hex = bin2hex($binary);
echo $hex;
// 0001e23f
Your pack() call properly returns the binary data 00 01 e2 3f which is a 4-byte big-endian representation of the number 123455. For a number, you can verify this by converting the number to hexadecimal (echo dechex(123455); => 1e23f) and prepending zeroes until you reach 4 bytes (8 hexadecimal characters, 0001e23f).
Echo'ing the binary data will make PHP treat it as a string, with 00 01 and e2 3f as the characters. 0x0001 is a control character (rendered as "�") and 0xe23f does not exist as a predefined character (it falls in the Private Use Area of the Unicode standard), so it will render as "?".

Charset of Textfile stored with file_put_contents() is missinterpretated

To prepare a download of some HTML contenteditable, as plain text file, I do following :
Send the html contenteditable, which inherits other html elements, through Ajax to a server side script prepareDownload.php.
There I create a new DOMDocument : $doc = new DOMDocument();
Then I do : $doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Then I am looking for text contents in certain elements and assemble it in $plainText
Finally I write $plainText to disk with : file_put_contents($txtFile, $plainText, LOCK_EX);
So far it works … but when I open the textfile the special characters like the German Ä are a mess.
To find out where the problem might be generated I place some print_r() commands on several stages in the php script and look into the browsers console whats coming back.
Until the point where I write $plainText with to disk file_put_contents() everything is perfect. Looking into the stored text file then, characters are a mess.
Now I assume that file_put_contents() misinterprets the given charset. But how to tell file_put_contents() that it should interpret (not encode) it as UTF-8 ?
EDIT:
As a test to find out more I replaced the explizit statement :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"])
with
$doc->loadHTML($_POST["data"])
The character ä in the file still looks weired, but different. The hexdump now looks like this :
0220: 20 76 69 65 6C 2C 20 65 72 7A C3 A4 68 6C 74 20 viel, erz..hlt
Now ä has two points (two bytes) and is hex C3 A4. What kind of encoding is this ?

printf() Extended Unicode Characters?

$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065
%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).

UTF-8 issue with PHP's json_decode

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.
How can I keep PHP/json_encode from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.
This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.
try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Convert two string to the same byte length

I have 2 strings in my PHP code, 1 is a parameter to my method and 1 is a string from an ini file.
The problem is that they are not equal, although they have the same content, probably due to encoding issues. When using var_dump, it is reported that the first string's lenght is 23 and the second string's length is 47 (see the end of my question for the reason behind this)
How can i make sure they are both encoded the same way and have the same length in the end so comparison won't fail? Preferably, i would like them to be utf8 encoded.
For reference, this is an excerpt from the code:
static function getString($keyword,$file) {
$lang_handle = parse_ini_file($file, true);
var_dump($keyword);
foreach ($lang_handle as $key => $value) {
var_dump($key);
if ($key == $keyword) {
foreach ($value as $subkey => $subvalue) {
var_dump("\t" . $subkey . " => " . $subvalue);
}
}
}
}
with the following ini:
[clientcockpit/login.php]
header = "Kunden Login"
username = "Benutzername"
password = "Passwort"
forgot = "Passwort vergessen"
login = "Login"
When calling the method with getString("clientcockpit/login.php", "inifile.ini") the output is:
string 'clientcockpit/login.php' (length=23)
string '�c�l�i�e�n�t�c�o�c�k�p�i�t�/�l�o�g�i�n�.�p�h�p�' (length=47)
Your INI file seems to be in UTF16 encoding or similar, using two bytes to represent a single character. I guess that the strange characters in your string are actually NULL bytes (\0).
PHP's Unicode support is quite poor and I guess that parse_ini_file() does not support multibyte encodings properly. It will treat the file as if it was encoded using a "ASCII-compatible" single-byte encoding, just looking for special characters [ and ] to detect sections. As a result, the section keys will be corrupted: One byte actually belonging to [ or ] will be part of the section key:
UTF-16: [c] (3 characters, 6 bytes)
For UTF-16BE (big endian):
Bytes: 00 5B 00 63 00 5D (6 bytes)
ASCII: \0 [ \0 c \0 ] (6 characters)
For UTF-16LE (little endian):
Bytes: 5B 00 63 00 5D 00 (6 bytes)
ASCII: [ \0 c \0 ] \0 (6 characters)
Assuming ASCII, instead of reading c, parse_ini_file() will read \0c\0 if the source file encoding is UTF-16.
If you can control the format of your INI file, make sure to save it in UTF8 or ISO-8859-1 encoding, using your favorite text editor.
Otherwise you will have to read in the file contents using file_get_contents(), do the encoding conversion (eg. using iconv()) and pass the result to parse_ini_string(). The drawback here is that you will have to detect or hardcode the original file encoding.
If the mb multibyte extension is available on your PHP installation, you can use mb_detect_encoding() and mb_convert_encoding() to do the conversion dynamically.
Try this:
$lang_handle = parse_ini_string(file_get_contents($file), true);

Categories