Compress gzip string - php

I want to compress a string in PHP and write it to a file without using the gzwrite function as I want to store the actual compressed string in a database first, but I am unsure as whether to use gzcompress, gzencode or gzdeflate as it's not very clear.
Any ideas?
Edit: the already compressed string will be written into a *.gz file from the database so it has to be compatible.

Use gzcompress if you just want to compress the string.
gzencode will also add gzip file headers so it can be uncompressed directly by gzip and similar tools.
gzdeflate uses the deflate algorithm which is very similar to the first one.
I think yo want to use gzencode in ths case since the data is going to be stored as a file.

Related

How to convert base64 encoded attachment to file (PDF, msg, eml)

We have an XML file exported from ServiceNow which we are trying to import into our custom PHP app.
Each attachment <sys_attachment> are split into chunks <sys_attachment_doc> which is ordered by the <position> element.
<sys_attachment>
<chunk_size_bytes>734003</chunk_size_bytes>
<compressed>true</compressed>
<content_type>application/pdf</content_type>
<encryption_context display_value="" />
<file_name>Filename.pdf</file_name>
</sys_attachment>
<sys_attachment_doc>
<data>[BASE64 ENCODED STRING HERE]</data>
<length>[STRING LENGTH]</length>
<position>0</position>
</sys_attachment_doc>
<sys_attachment_doc>
<data>[BASE64 ENCODED STRING HERE]</data>
<length>[STRING LENGTH]</length>
<position>1</position>
</sys_attachment_doc>
We have tried combining the string and base64_decoding it but to no avail.
<?php
header('Content-type: application/pdf');
header('Content-Disposition: attachment; filename="servicenow.pdf"');
//echo base64_decode($chunk0.$chunk1);
echo base64_decode($chunk0).base64_decode($chunk1);
?>
We are unable to find any documentation on how to convert these attachments to files outside of ServiceNow (PHP). Is there an extra step that needs to be done before decoding the string and converting to file (PDF)
Edit: I manage to solve it using #Joey answer. I base64_decode the chunks then afterwards combine it. The combined string is actually gzip compressed. We used gzdecode() to generate the PDF.
$attachment = base64_decode($chunk0).base64_decode($chunk1);
echo gzdecode($attachment);
One thing that may be tripping you up is that <compressed> flag. Since that's reading as true, the data is also gzipped, so attachments start from byte[], which then get gzipped, broken into chunks, and base64 encoded (per chunk!).
I don't know how to do this in php specifically, but this strategy should work:
Base64 decode of each chunk will give you a byte[] per chunk.
Combine those chunks in order of position to give you one big byte stream
gunzip that stream into another big byte stream which should be your file.

Piecemeal bzcompression for large files in PHP

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of
$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);
I tried out a piecemeal bzcompression using the routines shown below
function makeBZFile($infile,$outfile)
{
$fp = fopen($infile,'r');
$bz = bzopen($outfile,'w');
while (!feof($fp))
{
$bytes = fread($fp,10240);
bzwrite($bz,$bytes);
}
bzclose($bz);
fclose($fp);
}
function unmakeBZFile($infile,$outfile)
{
$bz = bzopen($infile,'r');
while (!feof($bz))
{
$str = bzread($bz,10240);
file_put_contents($outfile,$str,FILE_APPEND);
}
}
set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');
To test this code I did two things
I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1
In both cases I performed a diff original.file decompressed.file and found that the two were identical.
All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.
What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.
I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.
I'd much appreciate a few explanations here.
To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.
BZIP2
Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).
BZIP2 Data Structure
The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).
Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.
GZIP/BZIP
Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.
GIP explained

Get compressed byte size after zlib_decode()?

I'm trying to use PHP to parse a custom gzip archive file format that was created in Delphi (not my code!). The format is basically:
4-byte integer: count of files in archive
for each compressed file:
4-byte integer: filename length [n]
[n] bytes: filename
4-byte integer: uncompressed file length [m]
[????] bytes: gzipped content
I can read the file and actually decode the first compressed file correctly by using zlib_decode() with a max uncompressed length of [m] bytes on the remainder of the file after I know the length ([m]), but then I'm stuck because I don't know how far into the substring I should go to find the next filename -- zlib_decode() doesn't return the number of compressed bytes that it processed before stopping. Since this is a custom format, it doesn't seem like I can use the normal gzopen()/gzread() functions because the entire file isn't compressed (I tried, it doesn't work).
This code works in Delphi because apparently you can pass a file handle back and forth between normal file reading functions and the System.ZLib decoding functions -- you can read [m] uncompressed bytes and the pointer will remain at the last compressed byte -- but PHP doesn't seem to support switching between read-as-normal and read-as-gzip on the fly that way.
Am I missing an obvious way in PHP to deal with a mixed-content file format like this, where metadata and compressed data are stacked together this way? Or am I out of luck without knowing the compressed data length?
A dirty workaround is to recompress the content of each file as I am able to parse it, use that to calculate the compressed length, and adjust the file pointer in the original file manually as follows:
$current_pos = ftell($handle);
$skip_length = strlen(gzencode($uncompressed_text,9,FORCE_DEFLATE));
fseek($handle, $skip_length+$current_pos);
This works, but feels very hack-ish. I'd still be open to any better approaches.
EDIT:
Just a note that this eventually failed. However, I was fortunate enough to know in advance the list of expected filenames and I was able to do the following (more reliable since zlib_decode() will decode as much as it can and discard the rest anyway):
foreach ($filenames as $thisFilename) {
$thisPos = strpos($rawData, $thisFilename);
$gzresult = zlib_decode(substr($rawData, $thisPos + strlen($table) + 8)); // skip 8 bytes for filename size and uncompressed data size, which are useless info.
}

Encode string in Android so PHP would be able to gzdecompress it?

How do I correctly compress a string, so PHP would be able to decompress?
I tried this:
public static byte[] compress(String string) throws IOException {
ByteArrayOutputStream os = new ByteArrayOutputStream(string.length());
DeflaterOutputStream gos = new DeflaterOutputStream(os);
// ALSO TRIED GZOutputStream, same results!
gos.write(string.getBytes());
gos.close();
byte[] compressed = os.toByteArray();
os.close();
return compressed;
}
But PHP does not recognize output as valid GZip compressed string...
The problem seems to be in some headers / footers being added by Android...
For example when I compress something word via PHP with gzcompress I got similar results as with Android, but not similar enough, so PHP could read it:
something (HEX DUMP):
Android: 1f8b08000000000000002bcecf4d2dc9c8cc4b0700fb31da0909000000
PHP: 789c2bcecf4d2dc9c8cc4b0700134703cf
The weirdest thing is that by changing GZOutputStream to DeflaterOutputStream it fixed the problem with something word, but the problem still appears with longer strings...
PS. Removing heading 10 characters from Android generated data does not help at all.
EDIT: I tried to decompress it in PHP with:
gzdecode() - this function does not exist in standard Debian PHP5
version.
gzdecompress() - does not work
And some functions to emulate gzdecode() from PHP site comments that don't really do much.
All above, with removing first 10 bytes and leaving them.
PS2. I tried every single solution from Stack Overflow, and other sources, and still nothing. It is not a duplicate.
EDIT2 (BINARY DUMP): Sample data generated with Android that can't be decomprssed by gzuncompress() or pseudo-gzdecode() functions from PHP.NET: data.compressed.
It supposed to be some JSON, after decompression.
The Android data that starts with 1f8b is a gzip stream. In php you use gzdecode() for that. gzencode() on php makes gzip streams.
The php data that starts with 789c is a zlib stream. You used gzcompress() to make that, and you would use gzuncompress() to decode it.
The compressed data contained within both of those streams, starting with 2bce is raw deflate data. You can use gzinflate() to decode that if you happened to make it somewhere, and you can use gzdeflate() to generate raw deflate.
Just to rant, gzencode(), gzcompress(), and gzdeflate() are some of the most misleading function names ever concocted, since only one of them is related to gzip yet all start with gz, and nothing in the name gzcompress() indicates zlib.
Update:
The "EDIT2" data is, for some reason, doubly compressed. It was compressed first to the zlib format, and then that zlib stream was compressed to the gzip format. (Though gzip couldn't compress the already compressed data, so it's a little bigger.)
You should repair the problem that made it doubly compressed. Or if you have no control over that, you can doubly decompress it, first stripping the gzip header using the RFC 1952 specification and then gzinflate() on the raw deflate data, and then using gzdecompress() on the result.

string contain many '\0' after inflate

I try to decompress blocks of data which were compressed with zlib and author made remarks that for decompress i must use inflate_init and inflate with Z_SYNC_FLUSH. I sure that this must work because that works on php in this way :
$temp = substr($temp, 2, -4);
$temp{0} = chr(ord($temp{0}) | 1);
$temp = gzinflate($temp);
but i ckecked many method for decompress this on C++ and every time fail.
Here is one of them :
char compressedblockbuffer[3371];
char uncompressedblockbuffer[8192];
is.read(compressedblockbuffer, 3371);
z_stream strm;
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 3371;
strm.next_in = (Bytef *)compressedblockbuffer;
strm.avail_out = 8192;
strm.next_out = (Bytef *)uncompressedblockbuffer;
inflateInit(&strm);
inflate(&strm, Z_SYNC_FLUSH);
inflateEnd(&strm);
It's not full code, just example to show problem and thats why i specified already known sizes.
I use last zlib realize so may be something change in the zlib inflate since 2003-2004 years?
So the result is :
So seems that uncompressedblockbuffer contains '\0' at the 2,3,4 indexes and many other and if i print this to console i just see two first elements.
UPD:
If gzinflate() in PHP works on the data, then your code won't. gzinflate() expects raw deflate data. Your code is looking for zlib-wrapped deflate data. If you want to decode raw deflate data, you need to use inflateInit2(&strm, -15) instead.
Your call to inflate() is likely returning an error that you are not checking for. You need to always check the return codes of the zlib routines, or for that matter any function that has the potential to return an error.
What kind of data are you decompressing? Many binary formats are perfectly accepting of NUL bytes in their data, since it just reads as a value of 0. For example, inside of image data in many formats, it'd just represent a value of 0 in either that channel or pixel (depending on data size). Not to mention, binary formats don't necessarily read as bytes. A NUL byte may actually be a part of a 2- or 4-byte value.
This is the problem with trying to read binary data as a character string. Binary data needn't follow the rules of text. This is why usually the data boundary is a separate size value, because it can't terminate on NUL values like text.
If you have the original uncompressed data for comparison, either load that data into memory and compare the data, or save the decompressed data to a file and use a diff tool to do a binary comparison of the files.

Categories