Get Zipfile as Bytes in php? - php

Using php, how can I read a zip file and get its bytes, for example something like
$contents = file_get_contents('myzipfile.zip');
echo $contents;
// outputs: 504b 0304 1400 0000 0800 1bae 2f46 20e0
Thank you!

file_get_contents gets the raw bytes, your echo outputs those raw bytes. If you expect to output a hexadecimal notation of the raw byte contents instead, use bin2hex:
echo bin2hex($contents);
If you want that arbitrarily grouped with a space every two bytes, you can do something along these lines:
echo join(' ', str_split(bin2hex($contents), 4));
(Note that this is all rather inefficient, modifying the entire, possibly many megabyte large file in memory. I'm expecting this is just for debugging purposes, so won't go out of my way to write super efficient code.)

file_get_contents() will return the exact contents of the file, so the format depends on the file type.
If you are looking for the byte size of the file you can get any available file information with the core SPL library's fileInfo class:
$info = new SplFileInfo('myzipfile.zip');
$bytes = $info->getSize();

Related

Piecemeal bzcompression for large files in PHP

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of
$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);
I tried out a piecemeal bzcompression using the routines shown below
function makeBZFile($infile,$outfile)
{
$fp = fopen($infile,'r');
$bz = bzopen($outfile,'w');
while (!feof($fp))
{
$bytes = fread($fp,10240);
bzwrite($bz,$bytes);
}
bzclose($bz);
fclose($fp);
}
function unmakeBZFile($infile,$outfile)
{
$bz = bzopen($infile,'r');
while (!feof($bz))
{
$str = bzread($bz,10240);
file_put_contents($outfile,$str,FILE_APPEND);
}
}
set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');
To test this code I did two things
I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1
In both cases I performed a diff original.file decompressed.file and found that the two were identical.
All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.
What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.
I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.
I'd much appreciate a few explanations here.
To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.
BZIP2
Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).
BZIP2 Data Structure
The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).
Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.
GZIP/BZIP
Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.
GIP explained

Get compressed byte size after zlib_decode()?

I'm trying to use PHP to parse a custom gzip archive file format that was created in Delphi (not my code!). The format is basically:
4-byte integer: count of files in archive
for each compressed file:
4-byte integer: filename length [n]
[n] bytes: filename
4-byte integer: uncompressed file length [m]
[????] bytes: gzipped content
I can read the file and actually decode the first compressed file correctly by using zlib_decode() with a max uncompressed length of [m] bytes on the remainder of the file after I know the length ([m]), but then I'm stuck because I don't know how far into the substring I should go to find the next filename -- zlib_decode() doesn't return the number of compressed bytes that it processed before stopping. Since this is a custom format, it doesn't seem like I can use the normal gzopen()/gzread() functions because the entire file isn't compressed (I tried, it doesn't work).
This code works in Delphi because apparently you can pass a file handle back and forth between normal file reading functions and the System.ZLib decoding functions -- you can read [m] uncompressed bytes and the pointer will remain at the last compressed byte -- but PHP doesn't seem to support switching between read-as-normal and read-as-gzip on the fly that way.
Am I missing an obvious way in PHP to deal with a mixed-content file format like this, where metadata and compressed data are stacked together this way? Or am I out of luck without knowing the compressed data length?
A dirty workaround is to recompress the content of each file as I am able to parse it, use that to calculate the compressed length, and adjust the file pointer in the original file manually as follows:
$current_pos = ftell($handle);
$skip_length = strlen(gzencode($uncompressed_text,9,FORCE_DEFLATE));
fseek($handle, $skip_length+$current_pos);
This works, but feels very hack-ish. I'd still be open to any better approaches.
EDIT:
Just a note that this eventually failed. However, I was fortunate enough to know in advance the list of expected filenames and I was able to do the following (more reliable since zlib_decode() will decode as much as it can and discard the rest anyway):
foreach ($filenames as $thisFilename) {
$thisPos = strpos($rawData, $thisFilename);
$gzresult = zlib_decode(substr($rawData, $thisPos + strlen($table) + 8)); // skip 8 bytes for filename size and uncompressed data size, which are useless info.
}

Unexpected output with zlib_encode()

I'm trying to encode a chunk of binary data with PHP in the same way zlib's compress2() function does it. However, using zlib_encode(), I get the wrong encoded output. I know this because I have a C program that does it (correctly). When I compare the output (using a hex editor) of the C program against that of the PHP script below, I notice it doesn't match at all.
My question I guess is, does this really compress in the same way zlib's compress2() function does?
<?php
$filename = 'C:\data.bin';
$in = fopen($filename, 'rb');
$data = fread($in, filesize($filename));
fclose($in);
$data_dec = zlib_decode($data);
$data_enc = zlib_encode($data_dec, ZLIB_ENCODING_DEFLATE, 9);
?>
The compression level is correct, so it should match with the C program's encoded output. Is there a bug somewhere perhaps.. ?
Yes, zlib_encode() (with the default arguments), and uncompress() are compatible, and compress2() and zlib_decode() are compatible.
The way to check is not to compare compressed output. Check by decompressing with uncompress() and zlib_decode(). There is no reason to expect that the compressed output will be the same, and it does not need to be. All that matters is that it can be losslessly decompressed on the other end.

PHP: Amount of bytes fread

Say I read a number of bytes like this:
$data = fread($fp, 4096);
Since fread will stop reading if it reaches the end of the file, how can I know exactly how much was read? Would strlen($data) work? Or could that be potentially wrong?
What I'm trying to accomplish, is to read a number of bytes, and then go back to where I was before I read. And I'm trying to avoid using arithmetic (ftell, fread, ftell, subract, fseek), since a file could potentially be larger than PHP_INT_MAX and potentially mess that up. What I would want is to just do fseek($fp, -$bytes_read, SEEK_CUR), but for that I need to know how many bytes I just read...
After fread use ftell($fp) to get the current file position.
Check this (untested):
mb_strlen($data, '8bit')
The second argument '8bit' forces the function to return the number of bytes.
Found in comments at php manual for mb_strlen.

PHP fseek() equivalent for variables?

What I need is an equivalent for PHP's fseek() function. The function works on files, but I have a variable that contains binary data and I want to work on it. I know I could use substr(), but that would be lame - it's used for strings, not for binary data. Also, creating a file and then using fseek() is not what I am looking for either.
Maybe something constructed with streams?
EDIT: Okay, I'm almost there:
$data = fopen('data://application/binary;binary,'.$bin,'rb');
Warning: failed to open stream: rfc2397: illegal parameter
Kai:
You have almost answered yourself here. Streams are the answer. The following manual entry will be enlightening: http://us.php.net/manual/en/wrappers.data.php
It essentially allows you to pass arbitrary data to PHP's file handling functions such as fopen (and thus fseek).
Then you could do something like:
<?php
$data = fopen('data://mime/type;encoding,' . $binaryData);
fseek($data, 128);
?>
fseek on data in a variable doesn't make sense. fseek just positions the file handle to the specified offset, so the next fread call starts reading from that offset. There is no equivalent of fread for strings.
Whats wrong with substr()?
With a file you would do:
$f = fopen(...)
fseek($f, offset)
$x = fread($f, len)
with substr:
$x = substr($var, offset, len)
I'm guessing, but maybe what is being asked for is a way to access bytes in a variable by using a pointer.. (using it like an array of bytes like you could do in c - without the memory overhead of putting the data in php arrays) and being able to edit them inplace without the overhead of copying the data.
Not being able to do this is a BIG problem, but if the operating system caches disk data well using fseek on a temporary file could be a workaround.

Categories