I'm wondering about why I need to cut off the last 4 Characters, after using gzcompress().
Here is my code:
header("Content-Encoding: gzip");
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00";
$index = $smarty->fetch("design/templates/main.htm") ."\n<!-- Compressed by gzip -->";
$this->content_size = strlen($index);
$this->content_crc = crc32($index);
$index = gzcompress($index, 9);
$index = substr($index, 0, strlen($index) - 4); // Why cut off ??
echo $index;
echo pack('V', $this->content_crc) . pack('V', $this->content_size);
When I don't cut of the last 4 chars, the source ends like:
[...]
<!-- Compressed by gzip -->N
When I cut them off it reads:
[...]
<!-- Compressed by gzip -->
I could see the additional N only in Chromes Code inspector (not in Firefox and not in IEs source). But there seams to be four additional characters at the end of the code.
Can anyone explain me, why I need to cut off 4 chars?
gzcompress implements the ZLIB compressed data format that has the following structure:
0 1
+---+---+
|CMF|FLG| (more-->)
+---+---+
(if FLG.FDICT set)
0 1 2 3
+---+---+---+---+
| DICTID | (more-->)
+---+---+---+---+
+=====================+---+---+---+---+
|...compressed data...| ADLER32 |
+=====================+---+---+---+---+
Here you see that the last four bytes is a Adler-32 checksum.
In contrast to that, the GZIP file format is a list of of so called members with the following structure:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
(if FLG.FHCRC set)
+---+---+
| CRC16 |
+---+---+
+=======================+
|...compressed blocks...| (more-->)
+=======================+
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| CRC32 | ISIZE |
+---+---+---+---+---+---+---+---+
As you can see, GZIP uses a CRC-32 checksum for the integrity check.
So to analyze your code:
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00"; – puts out the following header fields:
0x1f 0x8b – ID1 and ID2, identifiers to identify the data format (these are fixed values)
0x08 – CM, compression method that is used; 8 denotes the use of the DEFLATE data compression format (RFC 1951)
0x00 – FLG, flags
0x00000000 – MTIME, modification time
the fields XFL (extra flags) and OS (operation system) are set by the DEFLATE data compression format
echo $index; – puts out compressed data according to the DEFLATE data compression format
echo pack('V', $this->content_crc) . pack('V', $this->content_size); – puts out the CRC-32 checksum and the size of the uncompressed input data in binary
gzcompress produces output described here RFC1950 , the last 4 bytes you're chopping off is the adler32 checksum. This is the "deflate" encoding, so you should just set "Content-Encoding: deflate" and not manipulate anything.
If you want to use gzip, use gzencode() , which uses the gzip format.
Related
Is there any way to determine if a string is base64-encoded twice?
For example, is there a regex pattern that I can use with preg_match to do this?
(Practical answer.) Don't use regex. Decode your string using base64_decode()'s optional $strict parameter set to true and see if it matches the format you expect. Or simply try and decode it as many times as it permits. E.g.:
function base64_decode_multiple(string $data, int $count = 2) {
while ($count-- > 0 && ($decoded = base64_decode($data, true)) !== false) {
$data = $decoded;
}
return $data;
}
(Theoretical answer.) Double-base-64-encoded strings are regular, because there is a finite amount of byte sequences that properly base64-encode a base64-encoded message.
You can check if something is base64-encoded once since you can validate each set of four characters. The last four bytes in a base64-encoded message may be a special case because =s are used as padding. Using the regular expression:
<char> := [A-Za-z0-9+/]
<end-char> := [A-Za-z0-9+/=]
<chunk> := <char>{4}
<end-chunk> := <char>{2} <end-char>{2} | <char>{3} <end-char>
<base64-encoded> := <chunk>* <end-chunk>?
You can also determine if something is base64-encoded twice using regular expressions, but the solution is not trivial or pretty, since it's not enough to check 4 bytes at a time.
Example: "QUFBQQ==" base64-decodes to "AAAA" that base64-decodes to three NUL-bytes:
$ echo -n "QUFBQQ==" | base64 -d | xxd
00000000: 4141 4141 AAAA
$ echo -n "AAAA" | base64 -d | xxd
00000000: 0000 00 ...
At this point we could enumerate all double-base64-encodings where the base64-encoding is 4 bytes within the base64 alphabet ("AAAA", "AAAB", "AAAC", "AAAD", etc.) and minimize this:
<ugly 4> := QUFBQQ== | QUFBQg== | QUFBQw== | QUFBRA== | ...
And we could enumerate the first 4 bytes of all double-base64-encodings where the base64-encoding is 8 bytes or longer (cases that don't involve padding with =) and minimize that:
<chunk 4> := QUFB | QkFB | Q0FB | REFB | ...
One partition (the pretty one) of double-base64-encoded strings will not contain =s at the end; their lengths are a multiple of 8:
<pretty double-base64-encoded> := <chunk 4>{2}*
Another partition of double-base64-encoded strings will have lengths that are multiples of 4 but not 8 (4, 12, 20, etc.); they can be thought of as pretty ones with an ugly bit at the end:
<ugly double-base64-encoded> := <chunk 4>{2}* <ugly 4>
We could then construct a combined regular expression:
<double-base64-encoded> := <pretty double-base64-encoded>
| <ugly double-base64-encoded>
As I said, you probably don't want to go through all this mess just because double-base64-encoded messages are regular. Just like you don't want to check if an integer is within some finite interval. Also, this is a good example of getting the wrong answer when you should have been asking another question. :-)
I try to compress data with lz4_compress in php and uncompress data with https://github.com/pierrec/lz4 in golang
but it fails.
it seems that the lz4_compress output misses the lz4 header, and the block data is little different.
please help me solve the problem.
<?php
echo base64_encode(lz4_compress("Hello World!"));
?>
output:
DAAAAMBIZWxsbyBXb3JsZCE=
package main
import (
"bytes"
"encoding/base64"
"fmt"
"github.com/pierrec/lz4"
)
func main() {
a, _ := base64.StdEncoding.DecodeString("DAAAAMBIZWxsbyBXb3JsZCE=")
fmt.Printf("%b\n", a)
buf := new(bytes.Buffer)
w := lz4.NewWriter(buf)
b := bytes.NewReader([]byte("Hello World!"))
w.ReadFrom(b)
fmt.Printf("%b\n", buf.Bytes())
}
output:
[1100 0 0 0 11000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]
[100 100010 1001101 11000 1100100 1110000 10111001 1100 0 0 10000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]
lz4.h explicitly says
lz4.h provides block compression functions. It gives full buffer control to user.
Decompressing an lz4-compressed block also requires metadata (such as compressed size). Each application is free to encode such metadata in whichever way it wants.
An additional format, called LZ4 frame specification (doc/lz4_Frame_format.md),
take care of encoding standard metadata alongside LZ4-compressed blocks. If your application requires interoperability, it's recommended to use it. A library is provided to take care of it, see lz4frame.h.
The PHP extension doesn't do that; it produces bare compressed blocks.
http://lz4.github.io/lz4/ explicitly lists the PHP extension as not interoperable (in the "Customs LZ4 ports and bindings" section).
Sound good! And now try
echo -n DAAAAMBIZWxsbyBXb3JsZCE= | base64 -d
I got in first 4 bytes is written 0C 00 00 00 - that is the lenght of string and rest is Hello World!. Therefore I think, that if php realize that compression of such a short input is not possible it writes the input (try echo -n "Hello World!" | lz4c ). But problem is it does not allow you recognize such a thing, or I'm wrong?
I make a HTTP POST request to a remote service which requires the post body to be "deflated" (and Content-encoding: deflate should be sent in headers). From my understanding, this is covered in RFC 1950. Which php function should I use to be compatible?
gzencode
gzdeflate
gzcompress
Content-Encoding: deflate requires data to be presented using the zlib structure (defined in RFC 1950), with the deflate compression algorithm (defined in RFC 1951).
Consider
<?php
$str = 'test';
$defl = gzdeflate($str);
echo bin2hex($defl), "\n";
$comp = gzcompress($str);
echo bin2hex($comp), "\n";
?>
This gives us:
2b492d2e0100
789c2b492d2e0100045d01c1
so the gzcompress result is the gzdeflate'd buffer preceded by 789c, which appears to be a valid zlib header
0111 | 1000 | 11100 | 0 | 10
CINFO | CM | FCHECK | FDICT | FLEVEL
7=32bit | 8=deflate | | no dict | 2=default algo
and followed by 4 bytes of checksum. This is what we're looking for.
To sum it up,
gzdeflate returns a raw deflated buffer (RFC 1951)
gzcompress returns a deflated buffer wrapped in zlib stuff (RFC 1950)
Content-Encoding: deflate requires a wrapped buffer, that is, use gzcompress when sending deflated data.
Note the confusing naming: gzdeflate is not for Content-Encoding: deflate and gzcompress is not for Content-Encoding: compress. Go figure!
I was able to install webp support for imagemagick. But I'm missing some precise commands.
The basic is covered thru:
$im = new Imagick();
$im->pingImage($src);
$im->readImage($src);
$im->resizeImage($width,$height,Imagick::FILTER_CATROM , 1,TRUE );
$im->setImageFormat( "webp" );
$im->writeImage($dest);
But I'm missing lots of fine tuning options as described in the imageMagick command line documentation here:
http://www.imagemagick.org/script/webp.php
Specifically:
How do I set compression quality? (I tried setImageCompressionQuality and it does not work, ie the output is always the same size)
How do I set the "method" (from 0 to 6)?
Thanks
EDIT: I followed #emcconville's advice below (thanks!) and neither the method nor the compression worked. So I start to suspect my compilation of imagemagick.
I tried using command line:
convert photo.jpg -resize 1170x1170\> -quality 50 photo.webp
Wehn changing the 50 variable for quality the resulting file was always the same size. So there must be something wrong with my imagemagick...
How do I set the "method" (from 0 to 6)?
Try this...
$im = new Imagick();
$im->pingImage($src);
$im->readImage($src);
$im->resizeImage($width,$height,Imagick::FILTER_CATROM , 1,TRUE );
$im->setImageFormat( "webp" );
$im->setOption('webp:method', '6');
$im->writeImage($dest);
How do I set compression quality? (I tried setImageCompressionQuality and it does not work, ie the output is always the same size)
Imagick::setImageCompressionQuality seems to work for me, but note that webp:lossless becomes enabled if the values is 100, or greater (see source). You can test toggling lossless to see how that impacts results.
$img->setImageFormat('webp');
$img->setImageCompressionQuality(50);
$img->setOption('webp:lossless', 'true');
Edit from comments
Try testing the image conversion to webp directly with the cwebp utility.
cwebp -q 50 photo.jpg -o photo.webp
This will also write some statistical information to stdout, which can help debug what's happening.
Saving file 'photo.webp'
File: photo.jpg
Dimension: 1170 x 1170
Output: 4562 bytes Y-U-V-All-PSNR 55.70 99.00 99.00 57.47 dB
block count: intra4: 91
intra16: 5385 (-> 98.34%)
skipped block: 5357 (97.83%)
bytes used: header: 86 (1.9%)
mode-partition: 2628 (57.6%)
Residuals bytes |segment 1|segment 2|segment 3|segment 4| total
macroblocks: | 0%| 0%| 0%| 98%| 5476
quantizer: | 45 | 45 | 43 | 33 |
filter level: | 14 | 63 | 8 | 5 |
Also remember that for some subject matters, adjusting the compression quality doesn't always guaranty a file size decrease. But those are extreme edge cases.
Firstly, my Java version:
string str = "helloworld";
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes("UTF-8"));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0;i < localByteArrayOutputStream.toByteArray().length;i ++){
System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}
and output is:
31
-117
8
0
0
0
0
0
0
0
-53
72
-51
-55
-55
47
-49
47
-54
73
1
0
-83
32
-21
-7
10
0
0
0
Then the Go version:
var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&gzBf).Bytes()
for i := 0; i < len(GB); i++ {
fmt.Println(GB[i])
}
output:
31
139
8
0
0
9
110
136
0
255
202
72
205
201
201
47
207
47
202
73
1
0
0
0
255
255
1
0
0
255
255
173
32
235
249
10
0
0
0
Why?
I thought it might be caused by different byte reading methods of those two languages at first. But I noticed that 0 can never convert to 9. And the sizes of []byte are different.
Have I written wrong code? Is there any way to make my Go program get the same output as the Java program?
Thanks!
First thing is that the byte type in Java is signed, it has a range of -128..127, while in Go byte is an alias of uint8 and has a range of 0..255. So if you want to compare the results, you have to shift negative Java values by 256 (add 256).
Tip: To display a Java byte value in an unsigned fashion, use: byteValue & 0xff which converts it to int using the 8 bits of the byte as the lowest 8 bits in the int. Or better: display both results in hex form so you don't have to care about sign-ness...
Even if you do the shift, you will still see different results. That might be due to different default compression level in the different languages. Note that although the default compression level is 6 in both Java and Go, this is not specified and different implementations are allowed to choose different values, and it might also change in future releases.
And even if the compression level would be the same, you might still encounter differences because gzip is based on LZ77 and Huffman coding which uses a tree built on frequency (probability) to decide the output codes and if different input characters or bit patterns have the same frequency, assigned codes might vary between them, and moreover multiple output bit patterns might have the same length and therefore a different one might be chosen.
If you want the same output, the only way would be (see notes below!) to use the 0 compression level (not to compress at all). In Go use the compression level gzip.NoCompression and in Java use the Deflater.NO_COPMRESSION.
Java:
GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
{
def.setLevel(Deflater.NO_COMPRESSION);
}
};
Go:
gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)
But I wouldn't worry about the different outputs. Gzip is a standard, even if outputs are not the same, you will still be able to decompress the output with any gzip decoders whichever was used to compress the data, and the decoded data will be exactly the same.
Here are the simplified, extended versions:
Not that it matters, but your codes are unneccessarily complex. You could simplify them like this (these versions also include setting 0 compression level and converting negative Java byte values):
Java version:
ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
{ def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write("helloworld".getBytes("UTF-8"));
gz.close();
for (byte b : buf.toByteArray())
System.out.print((b & 0xff) + " ");
Go version:
var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&buf, gzip.NoCompression)
gz.Write([]byte("helloworld"))
gz.Close()
fmt.Println(buf.Bytes())
NOTES:
The gzip format allows some extra fields (headers) to be included in the output.
In Go these are represented by the gzip.Header type:
type Header struct {
Comment string // comment
Extra []byte // "extra data"
ModTime time.Time // modification time
Name string // file name
OS byte // operating system type
}
And it is accessible via the Writer.Header struct field. Go sets and inserts them, while Java does not (leaves header fields zero). So even if you set compression level to 0 in both languages, the output will not be the same (but the "compressed" data will match in both outputs).
Unfortunately the standard Java does not provide a way/interface to set/add these fields, and Go does not make it optional to fill the Header fields in the output, so you will not be able to generate exact outputs.
An option would be to use a 3rd party GZip library for Java which supports setting these fields. Apache Commons Compress is such an example, it contains a GzipCompressorOutputStream class which has a constructor which allows a GzipParameters instance to be passed. This GzipParameters is the equvivalent of the gzip.Header structure. Only using this would you be able to generate exact output.
But as mentioned, generating exact output has no real-life value.
From RFC 1952, the GZip file header is structured as:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Looking at the output you've provided, we have:
| Java | Go
ID1 | 31 | 31
ID2 | 139 | 139
CM (compression method) | 8 | 8
FLG (flags) | 0 | 0
MTIME (modification time) | 0 0 0 0 | 0 9 110 136
XFL (extra flags) | 0 | 0
OS (operating system) | 0 | 255
So we can see that Go is setting the modification time field of the header, and setting the operating system to 255 (unknown) rather than 0 (FAT file system). In other respects they indicate that the file is compressed in the same way.
In general these sorts of differences are harmless. If you want to determine if two compressed files are the same, then you should really compare the decompressed versions of the files though.