Why are there binary safe AND binary unsafe functions in php?

Why are there binary safe AND binary unsafe functions in php? - php

Is there any reason for this behavior/implementation ?Example:
$array = array("index_of_an_array" => "value");
class Foo {
private $index_of_an_array;
function __construct() {}
}
$foo = new Foo();
$array = (array)$foo;
$key = str_replace("Foo", "", array_keys($array)[0]);
echo $array[$key];
Gives us an error which is complete:
NOTICE Undefined index: on line number 9
Example #2:
echo date("Y\0/m/d");
Outputs:
2016
BUT! echo or var_dump(), for example, and some other functions, would output the string "as it is", just \0 bytes are being hidden by browsers.
$string = "index-of\0-an-array";
$strgin2 = "Y\0/m/d";
echo $string;
echo $string2;
var_dump($string);
var_dump($string2);
Outputs:
index-of-an-array
"Y/m/d"
string(18) "index-of-an-array"
string(6) "Y/m/d"
Notice, that $string lenght is 18, but 17 characters are shown.
EDIT
From possible duplicate and php manual:
The key can either be an integer or a string. The value can be of any type.
Strings containing valid integers will be cast to the integer type. E.g. the key "8" will actually be stored under 8. On the other hand "08" will not be cast, as it isn't a valid decimal integer. So in short, any string can be a key. And a string can contain any binary data (up to 2GB). Therefore, a key can be any binary data (since a string can be any binary data).
From php string details:
There are no limitations on the values the string can be composed of;
in particular, bytes with value 0 (“NUL bytes”) are allowed anywhere
in the string (however, a few functions, said in this manual not to be
“binary safe”, may hand off the strings to libraries that ignore data
after a NUL byte.)
But I still do not understand why the language is designed this way? Are there reasons for this behavior/implementation? Why PHP does'nt handle input as binary safe everywhere but just in some functions?
From comment:
The reason is simply that many PHP functions like printf use the C library's implementation behind the scenes, because the PHP developers were lazy.
Arent those such as echo, var_dump, print_r ? In other words, functions that output something. They are in fact binary safe if we take a look at my first example. Makes no sense to me to implement some binary-safe and binary-unsafe functions for output. Or just use some as they are in std lib in C and write some completely new functions.

The short answer to "why" is simply history.
PHP was originally written as a way to script C functions so they could be called easily while generating HTML. Therefore PHP strings were just C strings, which are a set of any bytes. So in modern PHP terms we would say nothing was binary-safe, simply because it wasn't planned to be anything else.
Early PHP was not intended to be a new programming language, and grew organically, with Lerdorf noting in retrospect: "I don’t know how to stop it, there was never any intent to write a programming language […] I have absolutely no idea how to write a programming language, I just kept adding the next logical step on the way."
Over time the language grew to support more elaborate string-processing functions, many taking the string's specific bytes into account and becoming "binary-safe". According to the recently written formal PHP specification:
As to how the bytes in a string translate into characters is unspecified. Although a user of a string might choose to ascribe special semantics to bytes having the value \0, from PHP's perspective, such null bytes have no special meaning. PHP does not assume strings contain any specific data or assign special values to any bytes or sequences.
As a language that has grown organically, there hasn't been a move to universally treat strings in a manner different from C. Therefore functions and libraries are binary-safe on a case-by-case basis.

Fist Example from Question
Your first example is a confusing because the error message is the part that's terminating on the null character not because the string is being handled incorrectly by the array. The original code you posted with the error message follows:
$array = array("index-of-an-array" => "value");
$string = "index-of\0-an-array";
echo $array[$string];
Notice: Undefined index: index-of in
Note, the error message above has been truncated index-of due to the null character, the array is working as expected because if you try it this way it will work just fine:
$array = array("index-of\0-an-array" => "value");
$string = "index-of\0-an-array";
echo $array[$string];
The error message correctly identified the that the two keys were wrong, which
they are
"index-of\0-an-array" != "index-of-an-array"
The problem is that the error message printed out everything up to the null character. If that's the case then it might be considered a bug by some.
Second Example is starting plumb the depths of PHP :)
I've added some code to it so we can see what's happening
<?php
class Foo {
public $index_public;
protected $index_prot;
private $index_priv;
function __construct() {
$this->index_public = 0;
$this->index_prot = 1;
$this->index_priv = 2;
}
}
$foo = new Foo();
$array = (array)$foo;
print_r($foo);
print_r($array);
//echo $array["\0Foo\0index_of_an_array2"];//This prints 2
//echo $foo->{"\0Foo\0index_of_an_array2"};//This fails
var_dump($array);
echo array_keys($array)[0] . "\n";
echo $array["\0Foo\0index_priv"] . "\n";
echo $array["\0*\0index_prot"] . "\n";
The above codes output is
Foo Object
(
[index_public] => 0
[index_prot:protected] => 1
[index_priv:Foo:private] => 2
)
Array
(
[index_public] => 0
[*index_prot] => 1
[Fooindex_priv] => 2
)
array(3) {
'index_public' =>
int(0)
'\0*\0index_prot' =>
int(1)
'\0Foo\0index_priv' =>
int(2)
}
index_public
2
1
The PHP developers choose to use the \0 character as a way to split member variable types. Note, protected fields use a * to indicate that the member variable may actually belong to many classes. It's also used to protect private access ie this code would not work.
echo $foo->{"\0Foo\0index_priv"}; //This fails
but once you cast it to an array then there is no such protection ie this works
echo $array["\0Foo\0index_priv"]; //This prints 2
Is there any reason for this behavior/implementation?
Yes. On any system that you need to interface with you need to make system
calls, if you want the current time or to convert a date etc you need to talk
to the operating system and this means calling the OS API, in the case of Linux
this API is in C.
PHP was original developed as a thin wrapper around C quite a few languages
start out this way and evolve, PHP is no exception.
Is there any reason for this behavior/implementation?
In the absence of any backwards compatibility issues I'd say some of the choices are less than optimal but my suspicion is that backwards compatibility is a large factor.
But I still do not understand why the language is designed this way?
Backwards compatibility is almost always the reason why features that people don't like remain in a language. Over time languages evolve and remove things but it's incremental and prioritized. If you had asked all the PHP developers do they want better binary string handling for some functions or a JIT compiler I think a JIT might win which it did in PHP 7. Note, the people doing the actual work ultimately decide what they work on and working on a JIT compiler is more fun than fixing libraries that do things in seemingly odd ways.
I'm not aware of a any language implementor that doesn't wish they'd done some things differently from the outset. Anyone implementing a compiler before a
language is popular is under a lot of pressure to get something that works for
them and that means cutting corners, not all languages in existence today had a
huge company backing them, most often it was a small dedicated team and they
made mistakes, some were lucky enough to get paid to do it. Calling them lazy
is a bit unfair.
All language have dark corners warts and boils and features you'll eventually hate. Some more than others and PHP has a bad rep because it has/had a lot more than most. Note, PHP 5 is a vast leap forward from PHP 4. I'd imagine that PHP 7 will improve things even more.
Anyone that thinks their favorite language is free from problems is delusional and has almost certainly not plumbed the depths of the tool their using to any great depth.

Functions in PHP which internally operate with C strings are "not binary safe" in PHP terminology. C string is an array of bytes ending with byte 0. When a PHP function internally uses C strings, it reads one by one character and when it encounters byte 0 it considers it as an end of string. Byte 0 tells C string functions where is the end of string since C string does not contain any information about string length.
"Not binary safe" means that, if function which operates with C string is somehow handed a C string not terminated with byte 0, behavior is unpredictable because function will read/write bytes beyond end of the string, adding garbage to string and/or potentially crashing PHP.
In C++, for example, we have string object. This object also contains an array of characters, but it has also a length field which it updates on any length change. So it does not require byte 0 to tell it where the end is. This is why string object can contain any number of 0 bytes, although this is generally not valid since it should contain only valid characters.
In order for this to be corrected, the whole PHP core, including any modules which operate with C strings, need to be rewritten in order to send "non binary safe" functions to history. The amount of job needed for this is huge and all the modules' creators need to produce new code for their modules. This can introduce new bugs and instabilities into the whole story.
Issue with byte 0 and "non binary safe" functions is not that much critical to justify rewriting PHP and PHP modules code. Maybe in some newer PHP version where some things need to be coded from scratch it would make sense to correct this.
Until then, you just need to know that any arbitrary binary data put to some string by using binary-safe functions needs to have byte 0 added at the end. Usually you will notice this when there is unexpected garbage at end of your string or PHP crashes.

Related

Hash Function that works identically on ColdFusion 10+ and PHP 7.x?

I am currently working on a new PHP site for a site currently utilizing ColdFusion 10. When the new site is ready the ColdFusion site will be decommissioned and I won't have access to anything related to ColdFusion. I don't want to have to reset all the previous passwords so need to be able to duplicate the one-way SHA-512 hash that is utilized in ColdFusion in PHP.
This question on Stack Overflow is very relevant to this problem:
hash function that works identically on ColdFusion MX7 and PHP 5.x?
The difference is they manually looped in 1024 iterations in ColdFusion. The ColdFusion site I am translating uses the built in iterations feature. I have tried what they did in the above question, plus a few variations including XOR in the end but ultimately I can't find documentation on what ColdFusion is doing during those iterations.
ColdFusion:
<cfset hpswd = Hash(FORM.npswd & salt, "SHA-512", "UTF-8", 1000) >
PHP (without iterations logic):
$hpswd = strtoupper(hash("sha512", $npswd.$salt));
Given this password: q7+Z6Wp#&#hQ
With this salt: F4DD573A-EC09-0A78-61B5DA6CBDB39F36
ColdFusion gives this Hash (with 1000 iterations): 1FA341B135918B61CB165AA67B33D024CC8243C679F20967A690C159D1A48FACFA4C57C33DDDE3D64539BF4211C44C8D1B18C787917CD779B2777856438E4D21
Even with making sure to strtoupper with PHP I have not managed to duplicate the iterations step so the question, what operands is ColdFusion 10+ doing during the iterations step?

Regardless of language, a SHA-512 hashing function should return the same output given the same inputs. Here, it looks like you may just need to ensure that your inputs are the same. This includes the encoding of the text you are inputting. Then you'll hash over it the same total number of times.
As of today, the CFDocs documentation of ColdFusion hash() is incorrect, but I have submitted a correction for that. See my comments above about why I believe Adobe lists their defaults this way. A Default of 1 Iteration is correct for Lucee CFML, but not for Adobe CF. You are correct that the ACF default is 0. CF2018 clarifies this parameter.
Now, to your issue, your original code in ACF10 is:
<cfset hpswd = Hash(FORM.npswd & salt, "SHA-512", "UTF-8", 1000) >
This says that you are hashing with the SHA-512 algorithm, using UTF-8 encoding, and repeating an additional 1000 times. This means that your hash() function is actually being called 1001 times for your final output.
So:
<cfset npswd="q7+Z6Wp#&##hQ">
<cfset salt = "F4DD573A-EC09-0A78-61B5DA6CBDB39F36">
<cfset hpswd = Hash(npswd & salt, "SHA-512","UTF-8",1000) >
<cfoutput>#hpswd#</cfoutput>
Gives us 1FA341B135918B61CB165AA67B33D024CC8243C679F20967A690C159D1A48FACFA4C57C33DDDE3D64539BF4211C44C8D1B18C787917CD779B2777856438E4D21.
https://trycf.com/gist/7212b3ee118664c5a7f1fb744b30212d/acf?theme=monokai
One thing to note is that the ColdFusion hash() function returns a HEXIDECIMAL string of the hashed input, but when it uses it's iteration argument, it iterates over the BINARY output of the hashed value. This will make a big difference in the final output.
https://trycf.com/gist/c879e9e900e8fd0aa23e766bc308e072/acf?theme=monokai
To do this in PHP, we'd do something like this:
NOTE: I am not a PHP developer, so this is probably not the best way to do this. Don't judge me, please. :-)
<?php
mb_internal_encoding("UTF-8");
$npswd="q7+Z6Wp#&#hQ";
$salt = "F4DD573A-EC09-0A78-61B5DA6CBDB39F36";
$hpswd = $npswd.$salt ;
for($i=1; $i<=1001; $i++){
$hpswd = hash("SHA512",$hpswd,true); // raw_output=true argument >> raw binary data.
// > https://www.php.net/manual/en/function.hash.php
}
echo(strtoupper(bin2hex($hpswd)));
?>
The first thing I do is ensure that the encoding we are using is UTF-8. Then I iterate over the given input string 1+1000 times. Using the raw_output argument of PHP hash() gives us binary representations each loop, which will give us the same final output. Afterwards, we use bin2hex() to convert the final binary value to a hexidecimal value, and then strtoupper() to uppercase it. Giving us 1FA341B135918B61CB165AA67B33D024CC8243C679F20967A690C159D1A48FACFA4C57C33DDDE3D64539BF4211C44C8D1B18C787917CD779B2777856438E4D21, matching the CF-hashed value.
Also note that CF returns an uppercase value whereas PHP is lowercase.
And final note: There are better methods for storing and using hashed passwords in PHP. This will help convert between CF and PHP hashes, but it would probably be better to ultimately convert all stored hashes into the PHP equivalents. https://www.php.net/manual/en/faq.passwords.php
=============================================================
A point of clarification:
Both Adobe and Lucee changed the name of this parameter to clarify their intent, however they behave differently.
Lucee named the parameter numIterations with default 1. This is the total times that hash() will run.
In CF2018, with the introduction of Named Parameters, Adobe renamed the parameter additionalIterations from the original (and still documented) iterations. The original improper parameter name didn't matter prior to CF2018 because you couldn't use named params anyway. On their hash() documentation page, their verbiage is "Hence, this parameter is the number of iterations + 1. The default number of additional iterations is 0." (emphasis mine) The behavior has always (since CF10) matched this description, but there is clearly some confusion about its actual meaning, especially since there is a difference with Lucee's behavior and with Adobe's incorrect initial name of the parameter.
The parameter name iterations is incorrect and doesn't work with either Adobe CF 2018 or Lucee 4.5 or 5.x. And this is a function that is not currently compatible as-is between Lucee and Adobe ColdFusion.
The important thing to remember, especially if working with both Adobe and Lucee code, is that this function with the ???Iterations named param specified will produce two different outputs if the same-ish code is run. Adobe will run a hash() one additional time vs Lucee. The good news is that since the param names aren't the same, then if they are used, an error will be thrown instead of silently producing different hashes.
hash("Stack Overflow","md5","UTF-8",42) ;
// Lucee: C0F20A4219490E4BF9F03ED51A546F27
// Adobe: 42C57ECBF9FF2B4BEC61010B7807165A
hash(input="Stack Overflow", algorithm="MD5", encoding="UTF-8", numIterations=42) ;
// Lucee: C0F20A4219490E4BF9F03ED51A546F27
// Adobe: Error: Parameter validation error
hash(string="Stack Overflow", algorithm="MD5", encoding="UTF-8", additionalIterations=42) ;
// Lucee: Error: argument [ADDITIONALITERATIONS] is not allowed
// Adobe: 42C57ECBF9FF2B4BEC61010B7807165A
https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-h-im/hash.html
https://docs.lucee.org/reference/functions/hash.html

See https://cfdocs.org/hash "...in Adobe CF the value is the number of additional iterations."

What does (binary) casting actually do and why it should not be relied upon?

I'm using PHP 7.2.12
I come across following statement from the Type Casting section of PHP Manual :
(binary) casting and b prefix forward support was added in PHP 5.2.1.
Note that the (binary) cast is essential the same as (string), but it
should not be relied upon.
I didn't understand the above text thoroughly. Someone please explain to me with good explanation.
I studied the following code examples given in PHP Manual on the same page :
<?php
$binary = (binary) $string;
var_dump($binary);
$binary = b"binary string";
var_dump($binary);
?>
Output :
Notice: Undefined variable: string in ..... on line 2
string(0) ""
string(13) "binary string"
If you look at the output above I got the same strings even after the casting to binary. So, what conversion job does binary casting actually do?
Why the binary casting should not be relied upon?
Also, explain to me on what types the binary casting can be done? I mean it's legal.
Nowhere in the PHP manual, there is any explanation or justification in this regard.
Someone please help me out on this by guiding me in a right direction.

PHP had Big Plans™ for PHP 6, where strings would finally become Unicode strings. To illustrate what that means, the current PHP behaviour:
$str = '漢字';
echo $str[0];
// ?
In PHP 6, this would have output "漢" instead of a broken ?. In other words, strings are encoding and character aware, instead of dumb byte arrays. (To output "漢" in current PHP versions, you need something like mb_substr($str, 0, 1, 'UTF-8').)
To keep the old dumb-byte-array behaviour, you could prefix your strings with b'漢字' and you could cast Unicode strings to dumb byte arrays using (binary). This was all added to PHP 5 in preparation for PHP 6, so you could start updating your code in advance.
Well, except PHP 6 never happened, and b'' prefixes and (binary) casts still don't do anything to this date.

Dealing with binary data and mb_function overloading?

I have a piece of code here which I need either assurance, or "no no no!" about in regards to if I'm thinking about this in the right or entirely wrong way.
This has to deal with cutting a variable of binary data at a specific spot, and also dealing with multi-byte overloaded functions. For example substr is actually mb_substr and strlen is mb_strlen etc.
Our server is set to UTF-8 internal encoding, and so theres this weird little thing I do to circumvent it for this binary data manipulation:
// $binary_data is the incoming variable with binary
// $clip_size is generally 16, 32 or 64 etc
$curenc = mb_internal_encoding();// this should be "UTF-8"
mb_internal_encoding('ISO-8859-1');// change so mb_ overloading doesnt screw this up
if (strlen($binary_data) >= $clip_size) {
$first_hunk = substr($binary_data,0,$clip_size);
$rest_of_it = substr($binary_data,$clip_size);
} else {
// skip since its shorter than expected
}
mb_internal_encoding($curenc);// put this back now
I can't really show input and output results, since its binary data. But tests using the above appear to be working just fine and nothing is breaking...
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Notes:
The binary data coming in, is a concatenation of those two parts to begin with.
The first part's size is always known (but changes).
The second part's size is entirely unknown.
This is pretty darn close to encryption and stuffing the IV on front and ripping it off again (which oddly, I found some old code which does this same thing lol ugh).
So, I guess my question is:
Is this actually fine to be doing?
Or is there something super obvious I'm overlooking?

However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Your brain is right, you shouldn't be doing that in PHP in the first place. :)
Is this actually fine to be doing?
It depends the purpose of your code.
I can't see any reason of the top of my head to cut a binary like that. So my first instinct would be "no no no!" use unpack() to properly parse the binary into usable variables.
That being said if you just need to split your binary because reasons, then I guess this is fine. As long as your tests confirm that the code is working for you, I can't see any problem.
As a side note, I don't use mbstring overloading exactly for this kind of use case - i.e. for whenever you need the default string functions.

MY SOLUTION TO THE WORRY
I dislike answering my own questions... but I wanted to share what I have decided on nonetheless.
Although what I had, "worked", I still wanted to change the hack-job-altering of the charset encoding. It was old code I admit, but for some reason, I never looked at hex2bin bin2hex for doing this. So I decided to change it to use those.
The resulting new code:
// $clip_size remains the same value for continuity later,
// only spot-adjusted here... which is why the *2.
$hex_data = bin2hex( $binary_data );
$first_hunk = hex2bin( substr($hex_data,0,($clip_size*2)) );
$rest_of_it = hex2bin( substr($hex_data,($clip_size*2)) );
if ( !empty($rest_of_it) ) { /* process the result for reasons */ }
Using the hex functions, turns the mess into something mb will not screw with either way. A 1 million bench loop, showed the process wasn't anything to be worried about (and its safer to run in parallel to itself than the mb_encoding mangle method).
So I'm going with this. It sits better in my mind, and resolves my question for now... until I revisit this old code again in a few years and go "what was I thinking ?!".

PHP Hex to Integer

I'm integrating a PHP application with an API that uses permissions in the form of 0x00000008, 0x20000000, etc. where each of these represents a given permission. Their API returned an integer. In PHP, how do I interpret something like 0x20000000 into an integer so I can utilize bitwise operators?
Another dev told me he thought these numbers were hex annotation, but googling that in PHP I'm finding limited results. What kind of numbers are these and how can I take an integer set of permissions and using bitwise operators determine if the user can 0x00000008.

As stated in the php documentation, much like what happens in the C language, this format is recognized in php as a native integer (in hexadecimal notation).
<?php
echo 0xDEADBEEF; // outputs 3735928559
Demo : https://3v4l.org/g1ggf
Ref : http://php.net/manual/en/language.types.integer.php#language.types.integer.syntax
Thus you could perform on them any bitwise operation since they are plain integers, with respect to the size of the registers on your server's processor (nowadays you are likely working on a 64-bit architecture, just saying in case it applies, you can confirm it with phpinfo() and through other means in doubt).
echo 0x00000001<<3; // 1*2*2*2, outputs 8
Demo : https://3v4l.org/AKv7H
Ref : http://php.net/manual/en/language.operators.bitwise.php
As suggested by #iainn and #SaltyPotato in their respective answer and comment, since you are reading the data from an API you might have to deal with these values obtained as strings instead of native integers. To overcome this you would just have to go through the hexdec function to convert them before use.
Ref : http://php.net/manual/en/function.hexdec.php

Since you are receiving it from an API php will interpret it as a string
echo hexdec('0x00000008');
returns the integer 8
This should work.

Can anyone explain the length parameter to fgets() in PHP?

Assume that I have a file named data.txt with the contents "Blah Blah !".
So when I use the code below
$hnd=fopen('data.txt','r');
echo fgets($hnd,2);
it displays just one character "B" instead of "Bl". Later I read the manual stating:
length
Reading ends when length - 1 bytes have been read, or a newline (which is included in the return value), or an EOF (whichever comes first). If no length is specified, it will keep reading from the stream until it reaches the end of the line.
Can anyone explain to me why it is this way? I mean why is it length-1 and not length.

The C fgets() function reads length - 1 bytes, because it has to add a terminating zero to turn the data into a proper string.
My best guess is that PHP's fgets() exhibits the same behaviour because it is either:
a legacy from the bad old days when PHP functions were little more that wrappers around the corresponding C functions, and string functions were binary unsafe (eg. strings could not contain embedded NUL characters). Changing the behaviour of the fgets() function would introduce new bugs in existing programs. Or,
a deliberate decision to make the PHP function compatible with the C function to avoid unnecessary surprises.
or both.
Interestingly, it looks like PHP internally adds a terminating zero when storing string values, for example in _php_stream_get_line() (called from fgets()) and zend_string_init().
Since _zend_string objects store the string length anyway, it shouldn't be necessary to store the terminating zero, unless there are still binary unsafe functions in PHP.

Because PHP, like many C-derivatives count from 0, and not from 1. They have Zero-based numbering
Eg for arrays: An array of length/size n has 0 to n - 1, elements.
i.e. 0, 1, 2 , 3, 4 .... n-1
So an array of length 5 has elements 0, 1, 2, 3, 4
So you will find that whether reading byte, strings, arrays... they always reference the to the (n-1)th element or marker, for an n-size structure

Please use following code for your raised questionarries
$hnd=fopen('E:\\data.txt','r');
echo fgets($hnd,2);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why are there binary safe AND binary unsafe functions in php? - php

Related

Hash Function that works identically on ColdFusion 10+ and PHP 7.x?

What does (binary) casting actually do and why it should not be relied upon?

Dealing with binary data and mb_function overloading?

PHP Hex to Integer

Can anyone explain the length parameter to fgets() in PHP?

Categories

Resources