When I run something like pack('N', "123455") or any variation of the 'N' option, I always get a character returned. The above example returns �?.
I am trying to work with Clamd and streaming to the socket and it needs "4 bytes unsigned integer in network byte order". I simply cannot get it to work.
echo'ing binary data will pretty much always output something that looks like that. Binary data is not meant to be read and understood by humans.
$binary = pack('N', "123455");
$hex = bin2hex($binary);
echo $hex;
// 0001e23f
Your pack() call properly returns the binary data 00 01 e2 3f which is a 4-byte big-endian representation of the number 123455. For a number, you can verify this by converting the number to hexadecimal (echo dechex(123455); => 1e23f) and prepending zeroes until you reach 4 bytes (8 hexadecimal characters, 0001e23f).
Echo'ing the binary data will make PHP treat it as a string, with 00 01 and e2 3f as the characters. 0x0001 is a control character (rendered as "�") and 0xe23f does not exist as a predefined character (it falls in the Private Use Area of the Unicode standard), so it will render as "?".
Related
$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065
%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).
How do I do something as simple as (in PHP) this code in C:
char buffer[5] = "testing";
FILE* file2 = fopen("data2.bin", "wb");
fwrite(buffer, sizeof buffer, 1, file2);
fclose(file2);
Whenever I try to write a binary file in PHP, it doesn't write in real binary.
Example:
$ptr = fopen("data2.bin", 'wb');
fwrite($ptr, "testing");
fclose($ptr);
I found on internet that I need to use pack() to do this...
What I expected:
testing\9C\00\00
or
7465 7374 696e 679c 0100 00
What I got:
testing412
Thanks
You're making the classic mistake of confusing data with the representation of that data.
Let's say you have a text file. If you open it in Notepad, you'll see the following:
hello
world
This is because Notepad assumes the data is ASCII text. So it takes every byte of raw data, interprets it as an ASCII character, and renders that text to your screen.
Now if you go and open that file with a hex editor, you'll see something entirely different1:
68 65 6c 6c 6f 0d 0a 77 6f 72 6c 64 hello..world
That is because the hex editor instead takes every byte of the raw data, and displays it as a two-character hexadecimal number.
1 - Assuming Windows \r\n line endings and ASCII encoding.
So if you're expecting hexadecimal ASCII output, you need to convert your string to its hexadecimal encoding before writing it (as ASCII text!) to the file.
In PHP, what you're looking for is the bin2hex function which "Returns an ASCII string containing the hexadecimal representation of str." For example:
$str = "Hello world!";
echo bin2hex($str); // output: 48656c6c6f20776f726c6421
Note that the "wb" mode argument doesn't cause any special behavior. It guarantees binary output, not hexadecimal output. I cannot stress enough that there is a difference. The only thing the b really does, is guarantee that line endings will not be converted by the library when reading/writing data.
I have data stored in an SQLite database as BINARY(16), the value of which is determined by PHP's hex2bin function on a 32-character hexadecimal string.
As an example, the string 434e405b823445c09cb6c359fb1b7918 returns CN#[4EÀ¶ÃYûy.
The data stored in this database needs to be manipulated by JavaScript, and to do so I've used the following function (adapted from Andris's answer here):
// Convert hexadecimal to binary string
String.prototype.hex2bin = function ()
{
// Define the variables
var i = 0, l = this.length - 1, bytes = []
// Iterate over the nibbles and convert to binary string
for (i; i < l; i += 2)
{
bytes.push(parseInt(this.substr(i, 2), 16))
}
// Return the binary string
return String.fromCharCode.apply(String, bytes)
}
This works as expected, returning CN#[4EÀ¶ÃYûy from 434e405b823445c09cb6c359fb1b7918.
The problem I have, however, is that when dealing directly with the data returned by PHP's hex2bin function I am given the string CN#[�4E����Y�y rather than CN#[4EÀ¶ÃYûy. This is making it impossible for me to work between the two (for context, JavaScript is being used to power an offline iPad app that works with data retrieved from a PHP web app) as I need to be able to use JavaScript to generate a 32-character hexadecimal string, convert it to a binary string, and have it work with PHP's hex2bin function (and SQLite's HEX function).
This issue, I believe, is that JavaScript uses UTF-16 whereas the binary string is stored as utf8_unicode_ci. My initial thought, then, was that I need to convert the string to UTF-8. Using a Google search led me to here and searching StackOverflow led me to bobince's answer here, both of which recommend using unescape(encodeURIComponent(str)). However, this does return what I need (CN#[�4E����Y�y):
// CN#[Â4EöÃYûy
unescape(encodeURIComponent('434e405b823445c09cb6c359fb1b7918'.hex2bin()))
My question, then, is:
How can I use JavaScript to convert a hexadecimal string into a UTF-8 binary string?
Given a hex-encoded UTF-8 string, `hex',
hex.replace(/../g, '%$&')
will produce a URI-encoded UTF-8 string.
decodeURIComponent converts URI-encoded UTF-8 sequences into JavaScript UTF-16 encoded strings, so
decodeURIComponent(hex.replace(/../g, '%$&'))
should decode a properly hex-encoded UTF-8 string.
You can see that it works by applying it to the example from the hex2bin documentation.
alert(decodeURIComponent('6578616d706c65206865782064617461'.replace(/../g, '%$&')));
// alerts "example hex data"
The string you gave is not UTF-8 encoded though. Specifically,
434e405b823445c09cb6c359fb1b7918
^
82 must follow a byte with at least the first two bits set, and 5b is not such a byte.
RFC 2279 explains:
The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the UCS-4
character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
Your applications don't have to handle binary at any point. Insertion is latest possible point and that's where you
convert to binary at last. Selection is earliest possible point and that's where you convert to hex, and use
hex-strings in application throughout.
When inserting, you can replace UNHEX with blob literals:
INSERT INTO table (id)
VALUES (X'434e405b823445c09cb6c359fb1b7918')
When selection, you can HEX:
SELECT HEX(id) FROM table
Expanding on Mike's answer, here's some code for encoding and decoding.
Note that the escape/unescape() functions are deprecated. If you need polyfills for them, you can check out the more comprehensive UTF-8 encoding example found here: http://jsfiddle.net/47zwb41o
// UTF-8 to hex
var utf8ToHex = function( s ){
s = unescape( encodeURIComponent( s ) );
var chr, i = 0, l = s.length, out = '';
for( ; i < l; i++ ){
chr = s.charCodeAt( i ).toString( 16 );
out += ( chr.length % 2 == 0 ) ? chr : '0' + chr;
}
return out;
};
// Hex to UTF-8
var hexToUtf8 = function( s ){
return decodeURIComponent( s.replace( /../g, '%$&' ) );
};
From the answers to this question I tried to make my program more safe by converting strings to hex and comparing those values instead of directly and dangerously using strings directly from the user. I modified the code on that question to add a conversion:
function mssql_escape($data) {
if(is_numeric($data))
return $data;
$data = iconv("ISO-8859-1", "UTF-16", $data);
$unpacked = unpack('H*hex', $data);
return '0x' . $unpacked['hex'];
}
I do this because in my database I am using nvarchar instead of varchar. Now when I run through it on the php side, it comes up with
0xfeff00680065006c006c006f00200077006f0072006c00640021
Then I run the following query:
declare #test nvarchar(100);
set #test = 'hello world!';
select CONVERT(VARBINARY(MAX), #test);
It results in:
0x680065006C006C006F00200077006F0072006C0064002100
Now you'll notice those numbers are ALMOST the same. Other than the trailing zeros, the only difference is feff00. Why is that there? I realize all I would have to do is shift, but I'd really like to know WHY it's there instead of just making an assumption. Can anybody explain to me why php decides to throw feff00 (yellow!) in the front of my hex?
Well, Andrew, I seem to answer a lot of your questions. This link explains:
So the people were forced to come up with the bizarre convention of
storing a FE FF at the beginning of every Unicode string; this is
called a Unicode Byte Order Mark and if you are swapping your high and
low bytes it will look like a FF FE and the person reading your string
will know that they have to swap every other byte. Phew. Not every
Unicode string in the wild has a byte order mark at the beginning.
And Wikipedia explains:
If the 16-bit units are represented in big-endian byte order, this BOM
character will appear in the sequence of bytes as 0xFE followed by
0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text
display that expects the text to be ISO-8859-1.
if the 16-bit units
use little-endian order, the sequence of bytes will have 0xFF followed
by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a
text display that expects the text to be ISO-8859-1.
So the code you displayed with FEFF, which means it's in Big Endian notation. Use UTF-16LE for little endian, and SQL will understand that. Shifting the first SIX hex digits will only coincidentally work as long as you're only using two bytes.
Is there a native or inexpensive way to check for the length of a string in bytes in PHP?
See http://bytes.com/topic/php/answers/653733-binary-string-length
Relevant part:
"In PHP, like in C, the string ends with a zero-character, '\0', (char)
0, null-terminator, null-byte or whatever you like to call it."
No, that's not the case - PHP strings are stored with both the length and the
data, unlike C strings that just has one pointer and uses a terminator. They're
"binary-safe" - NUL doesn't terminate the string.
See the definition of zvalue_value in zend.h; the string part has both a "char
*val" and "int len".
Problems would start if you're using the mbstring.func_overload, which changes
how strlen() and the other functions work, and does try and treat strings as
strings of characters in a specific encoding rather than a string of bytes.
This is not the normal PHP behaviour.
The answer is that strlen should return the number of bytes regardless of the content of the string. For multi-byte character strings, you get the wrong number of characters, but the right number of bytes. However, you need to be certain you're not using the mbstring overload, which changes how strlen behaves.
In the event that you have mbstring overload set or your are developing for the platforms where you are unsure about this setting you can do the following:
$len=strlen(bin2hex($data))/2;
The reason why this works is that in Hex you are guaranteed to get 2 characters for all bytes that come from bin2hex (it returns two chars even for the initial binary 0).
Note that it will use significantly more resources than a normal strlen (afterall, so you should definitely not do that to the large amount of data if it's not absolutely necessary.
On php.org, someone was nice enough to create this function. Just multiply by 8 and you've got however many bits were in that string, as the function returns bytes.
The length of a string (textual data) is determined by the position of the NULL character which marks the end.
In case of binary data, NULL can be and often is in the middle of data.
You don't check the length of binary data. You have to know it beforehand. In your case, the length is 16 (bytes, not bits, if it is UUID).
As far as UUID validity is concerned, any 16-byte value is a valid UUID, so you are out of luck there.