I have binary data with a mix of uint32 and null terminated strings. I know the size of an individual data set ( each set of data shares the same format ), but not the actual format.
I've been using unpack to read the data with the following functions:
function read_uint32( $fh ){
$return_value = fread($fh, 4 );
$return_value = unpack( 'L', $return_value );
return $return_value[1];
}
function read_string( $fh ){
do{
$char = fread( $fh, 1 );
$return_string .= $char;
}while( ord( $char ) != 0 );
return substr($return_string, 0, -1);
}
and then basically trying both functions and seeing if the data makes sense as a string, and if not it's probably an int, is there an easier way to go about doing this?
Thanks.
well i think your approcah is okay.
well if you get only ascii strings its quite easy as the hightest bit will always be 0 or 1 (in some strange cases...) analyzing some bytes from the file and then look at the distribution will tell you probably whether its ascii or something binary.
if you have a different encoding like utf8 or something its really a pain in the ass.
you could probablly look for recurring CR/LF chars or filter out the raing 0-31 to only let tab, cr, lf, ff slip trhough. when you analyze the first X bytes and compare the ratio of non tab,cr,lf,ff chars and others. this will work for any encoding as the ascii range is normed...
to define the actual filetype its probably best to let this to the os layer and simply call file from the shell or use the php functions to get the mimetype...
Related
I am asking a safe way to read data from $_POST and write to file in PHP, because our PHP code was pointed out to be insecure. The criticism stated that our way may lead to opportunity for attackers to "destroy the data files, or create unexpected new files".
Please consider the following PHP code:
function read_POST($key)
{
$value = 'NULL';
// Check whether the $_POST[$key] is set.
if (array_key_exists($key, $_POST) !== true)
return $value;
// Restrict the length of user input.
$value = substr($_POST[$key], 0, 128);
// Remove possible scripting.
$value = trim(preg_replace("/<\?.*\?>/", '', $value));
// Convert all HTML special characters properly.
$value = filter_var($value, FILTER_SANITIZE_FULL_SPECIAL_CHARS);
return $value;
}
$val1 = read_POST('user_inp1');
$val2 = read_POST('user_inp2');
$val3 = read_POST('user_inp3');
if (($f = fopen('/some/path/of/data/file', 'w')) == True) {
fwrite($f, "Name: $val1\n");
fwrite($f, "Location: $val2\n");
fwrite($f, "Weather: $val3\n");
fclose($f);
}
Here, for simplicity, assuming that all the read values can be arbitrary (but safely sanitized) strings. Could anyone comment on this piece of code ? Are there any security issues ? If yes, how to improve it ?
Thank you very much.
T.H.Hsieh
There are no way for an attacker to destory or create files arbitrarily as you're just writing to one specific file and the file name is hard coded.
Here some insights:
// Restrict the length of user input.
$value = substr($_POST[$key], 0, 128);
I assume you want to support unicode text, after all we are in 2022 at the time of writing. The easieast way is set up everithing (starting from the web pages you're serving, with the forms that send your server user's data) to encode text as UTF-8. If you're unfamiliar with this topic you'll find plenty of resources online.
Said that it's safer to trim text in a UTF-8 aware way:
$value = mb_substr( $value, 0, 128 );
Note that this way you may end with a string that takes more than 128 bytes of size (since utf-8 encoded characters may take up to 4 bytes).
// Remove possible scripting.
$value = trim(preg_replace("/<\?.*\?>/", '', $value));
// Convert all HTML special characters properly.
$value = filter_var($value, FILTER_SANITIZE_FULL_SPECIAL_CHARS);
I would not do this here. You're storing text.
When you will need that text then you'll escape it depending on where you will use it. Is it a web page? Then html-encode. Is it a url fragment? url-encode it. Is it a db? If the db is configured to receive UTF-8 text then nothing to do.
The only thing you need to do is avoid newlines \n due to the way you're saving data.
If you do the sanitization as per your code then newlines are already stripped away.
If you follow my advice then you'll need to do it.
$value = str_replace( "\n", " ", $value ); // let's turn them into spaces
$value = str_replace( "\r", " ", $value ); // let's do it for carriage returns too
You want to do this because a "forged" POST request where, for example, user_inp3 is sunny\nName: foo\nLocation: nowhere\nWeather: rainy would result in unwanted extra record being created on the output file.
if (($f = fopen('/some/path/of/data/file', 'w')) == True) {
fwrite($f, "Name: $val1\n");
fwrite($f, "Location: $val2\n");
fwrite($f, "Weather: $val3\n");
fclose($f);
}
Here you have two issues:
first as mentioned in the comments using w option with fopen will result in the file being rewritten at each script execution.
Assuming you need to keep the data use a and the new data will be appended to the file.
Second, you may have a concurrency issue: if two requests occurr at the same time you may have two instances of the php interpreter writing the output file at the same time messing up data.
This is why normally databases are used to store data.
If you need to stick with the filesystem approach then lock the file before writing, ensure the lock succeeded then unlock it at the end. See flock
I'm trying to encode a chunk of binary data with PHP in the same way zlib's compress2() function does it. However, using zlib_encode(), I get the wrong encoded output. I know this because I have a C program that does it (correctly). When I compare the output (using a hex editor) of the C program against that of the PHP script below, I notice it doesn't match at all.
My question I guess is, does this really compress in the same way zlib's compress2() function does?
<?php
$filename = 'C:\data.bin';
$in = fopen($filename, 'rb');
$data = fread($in, filesize($filename));
fclose($in);
$data_dec = zlib_decode($data);
$data_enc = zlib_encode($data_dec, ZLIB_ENCODING_DEFLATE, 9);
?>
The compression level is correct, so it should match with the C program's encoded output. Is there a bug somewhere perhaps.. ?
Yes, zlib_encode() (with the default arguments), and uncompress() are compatible, and compress2() and zlib_decode() are compatible.
The way to check is not to compare compressed output. Check by decompressing with uncompress() and zlib_decode(). There is no reason to expect that the compressed output will be the same, and it does not need to be. All that matters is that it can be losslessly decompressed on the other end.
In php is there a way to write binary data to the response stream,
like the equivalent of (c# asp)
System.IO.BinaryWriter Binary = new System.IO.BinaryWriter(Response.OutputStream);
Binary.Write((System.Int32)1);//01000000
Binary.Write((System.Int32)1020);//FC030000
Binary.Close();
I would then like to be able read the response in a c# application, like
System.Net.HttpWebRequest Request = (System.Net.HttpWebRequest)System.Net.WebRequest.Create("URI");
System.IO.BinaryReader Binary = new System.IO.BinaryReader(Request.GetResponse().GetResponseStream());
System.Int32 i = Binary.ReadInt32();//1
i = Binary.ReadInt32();//1020
Binary.Close();
In PHP, strings and byte arrays are one and the same. Use pack to create a byte array (string) that you can then write. Once I realized that, life got easier.
$my_byte_array = pack("LL", 0x01000000, 0xFC030000);
$fp = fopen("somefile.txt", "w");
fwrite($fp, $my_byte_array);
// or just echo to stdout
echo $my_byte_array;
Usually, I use chr();
echo chr(255); // Returns one byte, value 0xFF
http://php.net/manual/en/function.chr.php
This is the same answer I posted to this, similar, question.
Assuming that array $binary is a previously constructed array bytes (like monochrome bitmap pixels in my case) that you want written to the disk in this exact order, the below code worked for me on an AMD 1055t running ubuntu server 10.04 LTS.
I iterated over every kind of answer I could find on the Net, checking the output (I used either shed or vi, like in this answer) to confirm the results.
<?php
$fp = fopen($base.".bin", "w");
$binout=Array();
for($idx=0; $idx < $stop; $idx=$idx+2 ){
if( array_key_exists($idx,$binary) )
fwrite($fp,pack( "n", $binary[$idx]<<8 | $binary[$idx+1]));
else {
echo "index $idx not found in array \$binary[], wtf?\n";
}
}
fclose($fp);
echo "Filename $base.bin had ".filesize($base.".bin")." bytes written\n";
?>
You probably want the pack function -- it gives you a decent amount of control over how you want your values structured as well, i.e., 16 bits or 32 bits at a time, little-endian versus big-endian, etc.
Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line
My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:
$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
print_r($row);
// further code not relevant as the data is already corrupt at this point
}
For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.
The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:
"15","Ústav"
"420","Špičák"
"7","Tmaň"
becomes
Array (
0 => 15
1 => "stav"
)
Array (
0 => 420
1 => "pičák"
)
Array (
0 => 7
1 => "Tma"
)
(Note that č is kept, but Ú is dropped)
The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?
It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:
Note:
Locale setting is taken into account
by this function. If LANG is e.g.
en_US.UTF-8, files in one-byte
encoding are read wrong by this
function
In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).
I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:
$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
$row = csvstring_to_array($raw_row, ',', '"', "\n");
// $row is now read correctly
}
Say we have a UTF-8 string $s and we need to shorten it so it can be stored in N bytes. Blindly truncating it to N bytes could mess it up. But decoding it to find the character boundaries is a drag. Is there a tidy way?
[Edit 20100414] In addition to S.Mark’s answer: mb_strcut(), I recently found another function to do the job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension. Since intl is an ICU wrapper, I have a lot of confidence in it.
Edit: S.Mark's answer is actually better than mine - PHP has a (badly documented) builtin function that solves exactly this problem.
Original "back to the bits" answer follows:
Truncate to the desired byte count
If the last byte starts with 110 (binary), drop it as well
If the second-to-last byte starts with 1110 (binary), drop the last 2 bytes
If the third-to-last byte starts with 11110 (binary), drop the last 3 bytes
This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.
Unfortunately (as Andrew reminds me in the comments) there are also cases where two separately encoded Unicode code points form a single character (basically, diacritics such as accents can be represented as separate code point modifying the preceding letter).
Handling this kind of thing requires advanced Unicode-Fu which is not available in PHP and may not even be possible for all cases (there are somne weird scripts out there!), but fortunately it's relatively rare, at least for Latin-based languages.
I think you don't need to reinvent the wheel, you could just use mb_strcut and make sure you set encoding to UTF-8 first.
mb_internal_encoding('UTF-8');
echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.
its return
\xc2\x80
because in \xc2\x80\xc2, last one is invalid
I coded up this simple function for this purpose, you need mb_string though.
function str_truncate($string, $bytes = null)
{
if (isset($bytes) === true)
{
// to speed things up
$string = mb_substr($string, 0, $bytes, 'UTF-8');
while (strlen($string) > $bytes)
{
$string = mb_substr($string, 0, -1, 'UTF-8');
}
}
return $string;
}
While this code also works, S.Mark answer is obviously the way to go.
Here's a test for mb_strcut(). It doesn't prove that it does just what we're looking for but I find it pretty convincing.
<?php
ini_set('default_charset', 'UTF-8' );
$strs = array(
'Iñtërnâtiônàlizætiøn',
'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
'ايران لا ترى تغييرا في الموقف الأمريكي',
'独・米で死傷者を出した銃の乱射事件',
'國會預算處公布驚人的赤字數據後',
'이며 세계 경제 회복에 걸림돌이 되고 있다',
'В дагестанском лесном массиве южнее села Какашура',
'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
'რუსეთი ასევე გეგმავს სამხედრო');
for ( $i = 10; $i <= 30; $i += 5 ) {
foreach ($strs as $s) {
$t = mb_strcut($s, 0, $i, 'UTF-8');
print(
sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1'))
. ( mb_check_encoding($t, 'UTF-8') ? ' OK ' : ' Bad ' )
. $t . "\n");
}
}
?>
In addition to S.Mark’s answer which was mb_strcut(), I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.
The functionality is a bit different: mb_strcut() documentation claims it cuts at the nearest UTF-8 character boundary, so it doesn't respect multi-character graphemes while grapheme_extract(), otoh, does. So depending what you need, grapheme_extract() might be better (e.g. to display a string) or mb_strcut() might be better (e.g. for indexing). Anyway, just though I'd mention it.
(And since intl is an ICU wrapper, I have a lot of confidence in it.)
No. There is no way to do this other than decoding. The coding is pretty mechanical however. See the pretty table in the wikipedia article
Edit: Michael Borgwardt shows us how to do it without decoding the whole string. Clever.