why golang and php gzip results are different [duplicate]

why golang and php gzip results are different [duplicate] - php

Firstly, my Java version:
string str = "helloworld";
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes("UTF-8"));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0;i < localByteArrayOutputStream.toByteArray().length;i ++){
System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}
and output is:
31
-117
8
0
0
0
0
0
0
0
-53
72
-51
-55
-55
47
-49
47
-54
73
1
0
-83
32
-21
-7
10
0
0
0
Then the Go version:
var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&gzBf).Bytes()
for i := 0; i < len(GB); i++ {
fmt.Println(GB[i])
}
output:
31
139
8
0
0
9
110
136
0
255
202
72
205
201
201
47
207
47
202
73
1
0
0
0
255
255
1
0
0
255
255
173
32
235
249
10
0
0
0
Why?
I thought it might be caused by different byte reading methods of those two languages at first. But I noticed that 0 can never convert to 9. And the sizes of []byte are different.
Have I written wrong code? Is there any way to make my Go program get the same output as the Java program?
Thanks!

First thing is that the byte type in Java is signed, it has a range of -128..127, while in Go byte is an alias of uint8 and has a range of 0..255. So if you want to compare the results, you have to shift negative Java values by 256 (add 256).
Tip: To display a Java byte value in an unsigned fashion, use: byteValue & 0xff which converts it to int using the 8 bits of the byte as the lowest 8 bits in the int. Or better: display both results in hex form so you don't have to care about sign-ness...
Even if you do the shift, you will still see different results. That might be due to different default compression level in the different languages. Note that although the default compression level is 6 in both Java and Go, this is not specified and different implementations are allowed to choose different values, and it might also change in future releases.
And even if the compression level would be the same, you might still encounter differences because gzip is based on LZ77 and Huffman coding which uses a tree built on frequency (probability) to decide the output codes and if different input characters or bit patterns have the same frequency, assigned codes might vary between them, and moreover multiple output bit patterns might have the same length and therefore a different one might be chosen.
If you want the same output, the only way would be (see notes below!) to use the 0 compression level (not to compress at all). In Go use the compression level gzip.NoCompression and in Java use the Deflater.NO_COPMRESSION.
Java:
GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
{
def.setLevel(Deflater.NO_COMPRESSION);
}
};
Go:
gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)
But I wouldn't worry about the different outputs. Gzip is a standard, even if outputs are not the same, you will still be able to decompress the output with any gzip decoders whichever was used to compress the data, and the decoded data will be exactly the same.
Here are the simplified, extended versions:
Not that it matters, but your codes are unneccessarily complex. You could simplify them like this (these versions also include setting 0 compression level and converting negative Java byte values):
Java version:
ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
{ def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write("helloworld".getBytes("UTF-8"));
gz.close();
for (byte b : buf.toByteArray())
System.out.print((b & 0xff) + " ");
Go version:
var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&buf, gzip.NoCompression)
gz.Write([]byte("helloworld"))
gz.Close()
fmt.Println(buf.Bytes())
NOTES:
The gzip format allows some extra fields (headers) to be included in the output.
In Go these are represented by the gzip.Header type:
type Header struct {
Comment string // comment
Extra []byte // "extra data"
ModTime time.Time // modification time
Name string // file name
OS byte // operating system type
}
And it is accessible via the Writer.Header struct field. Go sets and inserts them, while Java does not (leaves header fields zero). So even if you set compression level to 0 in both languages, the output will not be the same (but the "compressed" data will match in both outputs).
Unfortunately the standard Java does not provide a way/interface to set/add these fields, and Go does not make it optional to fill the Header fields in the output, so you will not be able to generate exact outputs.
An option would be to use a 3rd party GZip library for Java which supports setting these fields. Apache Commons Compress is such an example, it contains a GzipCompressorOutputStream class which has a constructor which allows a GzipParameters instance to be passed. This GzipParameters is the equvivalent of the gzip.Header structure. Only using this would you be able to generate exact output.
But as mentioned, generating exact output has no real-life value.

From RFC 1952, the GZip file header is structured as:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Looking at the output you've provided, we have:
| Java | Go
ID1 | 31 | 31
ID2 | 139 | 139
CM (compression method) | 8 | 8
FLG (flags) | 0 | 0
MTIME (modification time) | 0 0 0 0 | 0 9 110 136
XFL (extra flags) | 0 | 0
OS (operating system) | 0 | 255
So we can see that Go is setting the modification time field of the header, and setting the operating system to 255 (unknown) rather than 0 (FAT file system). In other respects they indicate that the file is compressed in the same way.
In general these sorts of differences are harmless. If you want to determine if two compressed files are the same, then you should really compare the decompressed versions of the files though.

Related

What is the encoder used by PHP $POST command for audio blob?

The first sequence of audio samples is recorded in JavaScript from the browser in mono, PCM format, 16-bit, and 96,000 Hz. I wrote this audio file as a blob to a server through a JavaScript FormData object using ajax.
These are the raw audio samples. When I retrieved the audio from the web server's directory listing, I received the second sequence of audio samples. It has been downsampled to 48000 Hz and the samples have been altered. What encoder is being used?
Server-side PHP code:
$input = $_FILES['audio']['tmp_name']; //audio blob
$output = $_POST['filename'];
if(move_uploaded_file($input, $output))
exit('Audio file Uploaded');
Client-side JavaScript code:
function send_audio(fn, blob){
var formData = new FormData();
formData.append("filename", fn);
formData.append('audio', blob);
$.ajax({
url:'save_audio.php',
type:'post',
data: formData,
contentType:false,
processData:false,
cache:false,
success: function(data){
console.log("send_audio success!");
console.log(data);
}
});
}

It's not clear whats going on but consider the following:
(1) Try 16-bits not 32-bits per sample (might help).
There is no need to record these audio samples at 32 bits per sample. That means you're using 4 bytes to record one sample value that fits within 2 bytes (16 bits).
Your highest value is 30465 (bytes x77 x01) and the lowest value is -32513 (bytes x80 xFF).
Notice those values fit two bytes? By using 4 bytes you make x00 x00 x77 x01. A waste to store extra00.
Two bytes hold max value of 65535. Or even 32767 if you divide into two parts as +32767 and -32767.
(2) Make sure there is no confusion between encoding of bytes (as hex), and digits (as integers).
Bytes are written as hex inside a file. Your values seem to mess up when converting from integers into hex format. Also 16 bit vs 32 bits for the
Some examples from your numbers compared to 2 bytes (16-bit) hex:
FIRST SEQ: SECOND SEQ:
value | hex | value | hex
100 00 64 612 02 64
553 02 29 41 00 29
203 00 CB -309 FE CB
-409 FE 67 103 00 67
68 00 44 836 03 44
953 03 B9 185 00 B9
154 00 9A -102 FF 9A
-127 FF 81 385 01 81
Notice the value of integer 100 when written as hex (byte values) it becomes byte x64, for some reason a 2 is added in front of that 64 now making it x0264 which converts to integer value of 612.
In the second sequence, where are these extra 02 and FE etc coming from? Why if hex of first seq has a 00, then for the second seq there's something extra now added? Even vice-versa (first = something, then second = something removed to become as 00)?
You need to double check your recording code. Does it add some increasing number to the recorded value?? Because, as you know, 2 is followed by 3 and that pattern is there in your second sequence. Also true that FE is followed by FF (and therefore previous to FE would be FD, FC, FB, FA, etc).
If you still can't explain it then:
Record at 16-bit with 44,100 Hz like the normal way others do it. Then later get complicated if the simple setup works okay. Why 32-bit and 96,000 Hz anyway? Is it a scientific test?
Use a Hex Editor to view your bytes from both sides (saved blob file via JS, against file downloaded via PHP) and compare any changes between the two files.
Show how/what your recording code is doing.
Share the bytes of "I retrieved the audio from the web server" (save as file) for checking.
Maybe someone else can help you. I don't use html forms so maybe there's an encoding option for the POST data type in there too?

problems with lz4 between php and golang

I try to compress data with lz4_compress in php and uncompress data with https://github.com/pierrec/lz4 in golang
but it fails.
it seems that the lz4_compress output misses the lz4 header, and the block data is little different.
please help me solve the problem.
<?php
echo base64_encode(lz4_compress("Hello World!"));
?>
output:
DAAAAMBIZWxsbyBXb3JsZCE=
package main
import (
"bytes"
"encoding/base64"
"fmt"
"github.com/pierrec/lz4"
)
func main() {
a, _ := base64.StdEncoding.DecodeString("DAAAAMBIZWxsbyBXb3JsZCE=")
fmt.Printf("%b\n", a)
buf := new(bytes.Buffer)
w := lz4.NewWriter(buf)
b := bytes.NewReader([]byte("Hello World!"))
w.ReadFrom(b)
fmt.Printf("%b\n", buf.Bytes())
}
output:
[1100 0 0 0 11000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]
[100 100010 1001101 11000 1100100 1110000 10111001 1100 0 0 10000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]

lz4.h explicitly says
lz4.h provides block compression functions. It gives full buffer control to user.
Decompressing an lz4-compressed block also requires metadata (such as compressed size). Each application is free to encode such metadata in whichever way it wants.
An additional format, called LZ4 frame specification (doc/lz4_Frame_format.md),
take care of encoding standard metadata alongside LZ4-compressed blocks. If your application requires interoperability, it's recommended to use it. A library is provided to take care of it, see lz4frame.h.
The PHP extension doesn't do that; it produces bare compressed blocks.
http://lz4.github.io/lz4/ explicitly lists the PHP extension as not interoperable (in the "Customs LZ4 ports and bindings" section).

Sound good! And now try
echo -n DAAAAMBIZWxsbyBXb3JsZCE= | base64 -d
I got in first 4 bytes is written 0C 00 00 00 - that is the lenght of string and rest is Hello World!. Therefore I think, that if php realize that compression of such a short input is not possible it writes the input (try echo -n "Hello World!" | lz4c ). But problem is it does not allow you recognize such a thing, or I'm wrong?

Parsing GIF application's extension blocks- how to find block size?

I am parsing a GIF 89a (yes, I need to) file and I am stuck on Application Extension blocks.
They have 13 byte header (including the beginning 21 FF 0B bytes) and then there is some data. How much data is there? How do I know know much to read?
You can skip the section below if you know the answer and just tell me :)
This page says:
ApplicationData contains the information that is used by the software application. This field is structured in a series of sub-blocks identical to the data found in a Plain Text Extension block."
Each sub-block begins with a byte that indicates the number of data bytes that follow. From 1 to 255 data bytes may follow this byte. There may be any number of sub-blocks in this field.
This way I can parse NETSCAPE 2.0 blocks which are:
03 01 00 00 00
so I have a loop in PHP:
for (;;)
{
$size = ord(fread($handle, 1));
if ($size == 0) break;
fseek($handle, $size);
}
or the same in Delphi, if you prefer:
while F.Position < F.Size do begin
F.Read(Size, 1); // F is TFileStream
if Size = 0 then break;
F.Position := F.Position + Size;
end;
The iteration goes:
size = read 1 byte; //size = 3;
read 3 byte;
size = read 1 byte;
size = 0 so break
So far, so good, here comes the problem: the XMP Data
So the bytes in this block go like this (ASCII below):
21 FF 0B 58 4D 50 20 44 61 74 61 58 4D 50
!`.XMP DataXMP
and then goes ASCII XML dump:
<?xpacket begin="ď»ż" id="W5M0MpCehiHzreSzNTczkc9d"?>
for about 500 bytes.
I obviously can't read it the same way I read NETSCAPE 2.0 blocks.
It seems to be terminated with 00 byte.
Should it just always read until 00 byte? Then if would fail on NETSCAPE 2.0 blocks!
How should a GIF decoder behave on Application Extension blocks? How much data is in them?
Problematic XMP Data image

Ok- the NETSCAPE 2.0 block approach might be fine and it was failing on the XML because my file could be corruptly read.

Convert HEX to ASCII, data from GPS tracker

I have just bought a GPS Tracker, it can send SMS to cellphone just fine. It also supports reporting to a server via GPRS.
I have setup the device to contact my own server on port 8123, it's a FreeBSD server and i have checked that i recieve packets on that port.
I successfully have setup a listener server written in PHP, and i can receive data from the device. But how do i convert the partial HEX data to something usefull (ASCII)?
Example data string:
$$^#T^#E Y'^WÿU210104.000,A,5534.4079,N,01146.2510,E,0.00,,170411,,*10|1.0|72|0000á
Unfortunately i don't know how i can copy-paste the HEX parts
Now how do i get the ID part out? I have tried echo hexdec(mb_substr($data, 4, 7));
The data is following this protocol
From the document:
Command format of GPRS packets are as follows:
From server to tracker:
##\r\n
From tracker to server:
$$\r\n
Note:
Do NOT input ‘’ when writing a command.
All multi-byte data complies with the following sequence: High byte prior to low byte.
The size of a GPRS packet (including data) is about 100 bytes
Item Specification
## 2 bytes. It means the header of packet from server to tracker.
It is in ASCII code (Hex code: 0x40)
$$ 2 bytes. It is the header of packet from tracker to server.
It is in ASCII code (Hex code: 0x24)
L 2 bytes. It means the length of the whole packet including
the header and ending character and it is in hex code
ID 7 bytes, ID must be digit and not over 14 digits, the unused byte
will be stuffed by ‘f’ or ‘0xff’. It is in the format of hex code.
For example, if ID is 13612345678, then it will be shown as
follows: 0x13, 0x61, 0x23, 0x45, 0x67, 0x8f, 0xff.
If all 7 bytes are 0xff, it is a broadcasting command. ID is in hex code
command 2 bytes. The command code is in hex code. Please refer to the
command list below.
data Min 0 byte and max 100 bytes. See Annex 1 for description of ‘data’.
checksum 2 bytes. It indicates CRC-CCITT (default is 0xffff) checksum of
all data (not including CRC itself and the ending character).
It is in hex code.
For example: 24 24 00 11 13 61 23 45 67 8f ff 50 00 05 d8 0d 0a
0x05d8 = CRC-CCITT (24 24 00 11 13 61 23 45 67 8f ff 50 00)
\r\n 2 bytes. It is the ending character and in hex code
(0x0d,0x0a in hex code)
Update
With the answer from Anomie, i was able to piece this together
$arr = unpack('H4length/H14id/H4cmd/H4crc/H4end', mb_substr($data, 2, 11) . mb_substr($data, -4));
var_dump($arr);
This will out put something like
array(5) {
["length"]=>
string(4) "0054"
["id"]=>
string(14) "004512345678ff"
["cmd"]=>
string(4) "9955"
["crc"]=>
string(4) "c97e"
["end"]=>
string(4) "0d0a"
}

It sounds like you are needing to convert binary data to integers or strings. The most straightforward way is to use unpack.
For example, to extract the length you already know you can use
$length_bin = substr($string, 2, 2);
To convert it to an integer, you can use something like
$length = unpack('v', $length_bin); $length = $length[1];
The 'v' code will work for the length and the checksum; if you have a number stored as 4 bytes use 'V', and for the ID you can use 'H*' to get it as a string of hex digits. Other codes are listed in the documentation.
A somewhat less straightforward way is to do the bit manipulation manually, after using unpack with C* to get an array of all the byte values. For example,
$bytes = unpack('C*', $length_bin);
$length = ($bytes[0] << 8) | $bytes[1];

You need to know the format of the messages you are going to receive from the device. You can get this info from the manufacturer. Then, depending on that, you have to create a proper listener in the server side.
I've been working with several devices like that and normally you have to create a process in the server listening to the port with a Socket (or similar). You may have an authentication process also to differentiate between devices (you can have more than one). After that, you simply get the data from the device, you parse it and you store it. Depending on the device you can also send requests or configurations.
Hope this helps

*Edit 26 April:* I have changed the question a bit, thus this seems out of place. Initial question was more on how to read the data from TCP.
I found some great articles on writing a TCP/socket server in PHP (/me slaps PHP around a bit with a large trout)
http://devzone.zend.com/article/1086
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/
Can't wait to get this going :)

How does gzcompress work?

I'm wondering about why I need to cut off the last 4 Characters, after using gzcompress().
Here is my code:
header("Content-Encoding: gzip");
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00";
$index = $smarty->fetch("design/templates/main.htm") ."\n<!-- Compressed by gzip -->";
$this->content_size = strlen($index);
$this->content_crc = crc32($index);
$index = gzcompress($index, 9);
$index = substr($index, 0, strlen($index) - 4); // Why cut off ??
echo $index;
echo pack('V', $this->content_crc) . pack('V', $this->content_size);
When I don't cut of the last 4 chars, the source ends like:
[...]
<!-- Compressed by gzip -->N
When I cut them off it reads:
[...]
<!-- Compressed by gzip -->
I could see the additional N only in Chromes Code inspector (not in Firefox and not in IEs source). But there seams to be four additional characters at the end of the code.
Can anyone explain me, why I need to cut off 4 chars?

gzcompress implements the ZLIB compressed data format that has the following structure:
0 1
+---+---+
|CMF|FLG| (more-->)
+---+---+
(if FLG.FDICT set)
0 1 2 3
+---+---+---+---+
| DICTID | (more-->)
+---+---+---+---+
+=====================+---+---+---+---+
|...compressed data...| ADLER32 |
+=====================+---+---+---+---+
Here you see that the last four bytes is a Adler-32 checksum.
In contrast to that, the GZIP file format is a list of of so called members with the following structure:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
(if FLG.FHCRC set)
+---+---+
| CRC16 |
+---+---+
+=======================+
|...compressed blocks...| (more-->)
+=======================+
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| CRC32 | ISIZE |
+---+---+---+---+---+---+---+---+
As you can see, GZIP uses a CRC-32 checksum for the integrity check.
So to analyze your code:
echo "\x1f\x8b\x08\x00\x00\x00\x00\x00"; – puts out the following header fields:
0x1f 0x8b – ID1 and ID2, identifiers to identify the data format (these are fixed values)
0x08 – CM, compression method that is used; 8 denotes the use of the DEFLATE data compression format (RFC 1951)
0x00 – FLG, flags
0x00000000 – MTIME, modification time
the fields XFL (extra flags) and OS (operation system) are set by the DEFLATE data compression format
echo $index; – puts out compressed data according to the DEFLATE data compression format
echo pack('V', $this->content_crc) . pack('V', $this->content_size); – puts out the CRC-32 checksum and the size of the uncompressed input data in binary

gzcompress produces output described here RFC1950 , the last 4 bytes you're chopping off is the adler32 checksum. This is the "deflate" encoding, so you should just set "Content-Encoding: deflate" and not manipulate anything.
If you want to use gzip, use gzencode() , which uses the gzip format.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.