I have data stored in an SQLite database as BINARY(16), the value of which is determined by PHP's hex2bin function on a 32-character hexadecimal string.
As an example, the string 434e405b823445c09cb6c359fb1b7918 returns CN#[4EÀ¶ÃYûy.
The data stored in this database needs to be manipulated by JavaScript, and to do so I've used the following function (adapted from Andris's answer here):
// Convert hexadecimal to binary string
String.prototype.hex2bin = function ()
{
// Define the variables
var i = 0, l = this.length - 1, bytes = []
// Iterate over the nibbles and convert to binary string
for (i; i < l; i += 2)
{
bytes.push(parseInt(this.substr(i, 2), 16))
}
// Return the binary string
return String.fromCharCode.apply(String, bytes)
}
This works as expected, returning CN#[4EÀ¶ÃYûy from 434e405b823445c09cb6c359fb1b7918.
The problem I have, however, is that when dealing directly with the data returned by PHP's hex2bin function I am given the string CN#[�4E����Y�y rather than CN#[4EÀ¶ÃYûy. This is making it impossible for me to work between the two (for context, JavaScript is being used to power an offline iPad app that works with data retrieved from a PHP web app) as I need to be able to use JavaScript to generate a 32-character hexadecimal string, convert it to a binary string, and have it work with PHP's hex2bin function (and SQLite's HEX function).
This issue, I believe, is that JavaScript uses UTF-16 whereas the binary string is stored as utf8_unicode_ci. My initial thought, then, was that I need to convert the string to UTF-8. Using a Google search led me to here and searching StackOverflow led me to bobince's answer here, both of which recommend using unescape(encodeURIComponent(str)). However, this does return what I need (CN#[�4E����Y�y):
// CN#[Â4EöÃYûy
unescape(encodeURIComponent('434e405b823445c09cb6c359fb1b7918'.hex2bin()))
My question, then, is:
How can I use JavaScript to convert a hexadecimal string into a UTF-8 binary string?
Given a hex-encoded UTF-8 string, `hex',
hex.replace(/../g, '%$&')
will produce a URI-encoded UTF-8 string.
decodeURIComponent converts URI-encoded UTF-8 sequences into JavaScript UTF-16 encoded strings, so
decodeURIComponent(hex.replace(/../g, '%$&'))
should decode a properly hex-encoded UTF-8 string.
You can see that it works by applying it to the example from the hex2bin documentation.
alert(decodeURIComponent('6578616d706c65206865782064617461'.replace(/../g, '%$&')));
// alerts "example hex data"
The string you gave is not UTF-8 encoded though. Specifically,
434e405b823445c09cb6c359fb1b7918
^
82 must follow a byte with at least the first two bits set, and 5b is not such a byte.
RFC 2279 explains:
The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the UCS-4
character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
Your applications don't have to handle binary at any point. Insertion is latest possible point and that's where you
convert to binary at last. Selection is earliest possible point and that's where you convert to hex, and use
hex-strings in application throughout.
When inserting, you can replace UNHEX with blob literals:
INSERT INTO table (id)
VALUES (X'434e405b823445c09cb6c359fb1b7918')
When selection, you can HEX:
SELECT HEX(id) FROM table
Expanding on Mike's answer, here's some code for encoding and decoding.
Note that the escape/unescape() functions are deprecated. If you need polyfills for them, you can check out the more comprehensive UTF-8 encoding example found here: http://jsfiddle.net/47zwb41o
// UTF-8 to hex
var utf8ToHex = function( s ){
s = unescape( encodeURIComponent( s ) );
var chr, i = 0, l = s.length, out = '';
for( ; i < l; i++ ){
chr = s.charCodeAt( i ).toString( 16 );
out += ( chr.length % 2 == 0 ) ? chr : '0' + chr;
}
return out;
};
// Hex to UTF-8
var hexToUtf8 = function( s ){
return decodeURIComponent( s.replace( /../g, '%$&' ) );
};
Related
I have an array of hex code in this format
FF8FFE00+++++
The above example is just one string, with + representing the rest of the over 60k long hex code (no use in hogging your broswer). The format looks exactly like that. So let's say
$a = 'FFD8FF++++ string';
echo base64_encode($a);
When printing the above, it takes the hex code as a string and generates the base64 out of the string instead of the hexdec
Looked all over, but it just seems that there are either conversions that make the hex code get hex encoded as well.
In NPP+ i converted a the string to ASCII then encoded in base64 and the result was the expected one (base64 that i can use for an image).
Any idea how I can tell php that the string is actually hex node, not a string?
With hex2bin you can transform a string containing hexadecimals into binary data (stored as a string in PHP). After that encode the binary string to base64 format.
$hex = 'FFD8FF'; // and much more hex values as string as in your example
$bin = hex2bin($hex); // convert the hex values to binary data stored as a PHP string
$b64 = base64_encode($bin); // this contains the base64 representation of the binary data
string(4) "/9j/"
So I was searching for a proper way in PHP to detect if a string is in the BMP range (Basic Multilingual Plane) but I found nothing. Even mb-check-encoding and mb_detect_encoding do not offer any help in this particular case.
So I wrote my own code
<?php
function is_bmp($string) {
$str_ar = mb_str_split($string);
foreach ($str_ar as $char) {
/*Check if there's any character's code point outside the BMP range*/
if (mb_ord($char) > 0xFFFF)
return false;
}
return true;
}
/*String containing non-BMP Unicode characters*/
$string = '😈blah blah';
var_dump(is_bmp($string));
?>
Outputs:
bool(false)
Now my question is:
Is there a better approach? and are there any flaws in it?
If you have an correct UTF-8 encoded input string, you can just check its bytes to figure out does it have symbols out of BMP or not.
Literally, you need to detect: does the input string contains any symbol, which codepoint is greater than 0xFFFF (i.e. longer than 16 bits)
Note on how UTF-8 encoding works:
Codepoints with codes 0 thru 0x7F are encoded as is. By one byte.
All other codepoints have a code within range 0xC0 ... 0xFF as the first byte, which also encodes how many additional bytes folows. And codes 0x80...0xBF as additional bytes.
To encode code points 0x10000 and greater, UTF-8 requires a sequence of 4 bytes, and the first byte of that sequence will be 0xF0 or greater. In all other cases the whole string will contain bytes less than 0xF0.
In short your task just to find: does the binary representation of the string contanin any byte of range 0xF0...0xFF?
function is_bmp($string) {
return preg_match('#[\xF0-\xFF]#', $string) != 0;
}
OR
even simpler (but probably less effective on speed), you can use ability of PCRE to work with UTF-8 sequences (see option PCRE_UTF8):
function is_bmp($string) {
return preg_match('#[^\x00-\x{FFFF}]#u', $string) != 0;
}
var_dump(
!preg_match('/[^\x0-\x{ffff}]/u', '😈blah blah')
);
From the answers to this question I tried to make my program more safe by converting strings to hex and comparing those values instead of directly and dangerously using strings directly from the user. I modified the code on that question to add a conversion:
function mssql_escape($data) {
if(is_numeric($data))
return $data;
$data = iconv("ISO-8859-1", "UTF-16", $data);
$unpacked = unpack('H*hex', $data);
return '0x' . $unpacked['hex'];
}
I do this because in my database I am using nvarchar instead of varchar. Now when I run through it on the php side, it comes up with
0xfeff00680065006c006c006f00200077006f0072006c00640021
Then I run the following query:
declare #test nvarchar(100);
set #test = 'hello world!';
select CONVERT(VARBINARY(MAX), #test);
It results in:
0x680065006C006C006F00200077006F0072006C0064002100
Now you'll notice those numbers are ALMOST the same. Other than the trailing zeros, the only difference is feff00. Why is that there? I realize all I would have to do is shift, but I'd really like to know WHY it's there instead of just making an assumption. Can anybody explain to me why php decides to throw feff00 (yellow!) in the front of my hex?
Well, Andrew, I seem to answer a lot of your questions. This link explains:
So the people were forced to come up with the bizarre convention of
storing a FE FF at the beginning of every Unicode string; this is
called a Unicode Byte Order Mark and if you are swapping your high and
low bytes it will look like a FF FE and the person reading your string
will know that they have to swap every other byte. Phew. Not every
Unicode string in the wild has a byte order mark at the beginning.
And Wikipedia explains:
If the 16-bit units are represented in big-endian byte order, this BOM
character will appear in the sequence of bytes as 0xFE followed by
0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text
display that expects the text to be ISO-8859-1.
if the 16-bit units
use little-endian order, the sequence of bytes will have 0xFF followed
by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a
text display that expects the text to be ISO-8859-1.
So the code you displayed with FEFF, which means it's in Big Endian notation. Use UTF-16LE for little endian, and SQL will understand that. Shifting the first SIX hex digits will only coincidentally work as long as you're only using two bytes.
I have to read a binary file written by a C++ app using Qt framework. Data is structured from a C struct as described below. Chars are written from a QString pointer in the C++ app.
struct panelSystem{
char ipAddress[16];
char netMask[16];
char gateway[16];
char paddingBytes[128];
};
I tried using the following PHP code to read the multibyte char values :
// Where: $length is a defined number (16 or 128 in this case)
// $data is a binary string read from the binary file
$var = substr($data, $currentOffset, $length);
$currentOffset += $length; // Increment offset by X bytes
$var = trim(str_replace("\0", "\n", $var));
$var = unpack("C*", $var);
$char = '';
foreach ($var as $letter) {
$char .= chr($letter);
}
$var = $char;
Unfortunately the result includes null (\0) and/or irrelevant characters before and after the desired char.
Is there a way to interpret or convert those char from QString multibyte array to PHP standard string (without modifying the original input) ?
Thank you.
A QString will be written to the binary with (at least) a length value and the character data. You also will need to take into account how the string is formatted. It might be in unicode utf-16, in which case, simple chars will have each character padded with zeros. Where the zeros go will depend upon what endian-ness the file is written in, though IIRC Qt will always store files in one particular endian-ness, regardless of the platform so that binaries are cross platform.
Have a look in the Qt source code to see how strings are written. Maybe also take a look at the binary file in a hex editor.
If your binary file only ever contains the struct, and you don't have a problem with the binary format being fixed, then you may find things far easier to write the file using raw IO:
struct panelSystem{
char ipAddress[16];
char netMask[16];
char gateway[16];
char paddingBytes[128];
void write(QDataStream& ds) {
ds.writeRawData(ipAddress, sizeof(ipAddress));
ds.writeRawData(netMask, sizeof(netMask));
ds.writeRawData(gateway, sizeof(gateway));
ds.writeRawData(paddingBytes, sizeof(paddingBytes));
}
};
//...
panelSystem p;
GetPanelSystemData(&p);
QFile file("c:\\work\\test\\testbin.bin");
if (file.open(QIODevice::Truncate | QIODevice::WriteOnly)) {
QDataStream ds(&file);
p.write(ds);
}
file.close();
I would recommend at least adding a version number/header to the start binary file to prevent painting yourself into a corner.
In PHP what does it mean by a function being binary-safe ?
What makes them special and where are they typically used ?
It means the function will work correctly when you pass it arbitrary binary data (i.e. strings containing non-ASCII bytes and/or null bytes).
For example, a non-binary-safe function might be based on a C function which expects null-terminated strings, so if the string contains a null character, the function would ignore anything after it.
This is relevant because PHP does not cleanly separate string and binary data.
The other users already mentioned what binary safe means in general.
In PHP, the meaning is more specific, referring only to what Michael gives as an example.
All strings in PHP have a length associated, which are the number of bytes that compose it. When a function manipulates a string, it can either:
Rely on that length meta-data.
Rely on the string being null-terminated, i.e., that after the data that is actually part of the string, a byte with value 0 will appear.
It's also true that all string PHP variables manipulated by the engine are also null-terminated. The problem with functions that rely on 2., is that, if the string itself contains a byte with value 0, the function that's manipulating it will think the string has ended at that point and will ignore everything after that.
For instance, if PHP's strlen function worked like C standard library strlen, the result here would be wrong:
$str = "abc\x00abc";
echo strlen($str); //gives 7, not 3!
More examples:
<?php
$string1 = "Hello";
$string2 = "Hello\x00World";
// This function is NOT ! binary safe
echo strcoll($string1, $string2); // gives 0, strings are equal.
// This function is binary safe
echo strcmp($string1, $string2); // gives <0, $string1 is less than $string2.
?>
\x indicates hexadecimal notation. See: PHP strings
0x00 = NULL
0x04 = EOT (End of transmission)
ASCII table to see ASCII char list