Is there any way to determine if a string is base64-encoded twice?
For example, is there a regex pattern that I can use with preg_match to do this?
(Practical answer.) Don't use regex. Decode your string using base64_decode()'s optional $strict parameter set to true and see if it matches the format you expect. Or simply try and decode it as many times as it permits. E.g.:
function base64_decode_multiple(string $data, int $count = 2) {
while ($count-- > 0 && ($decoded = base64_decode($data, true)) !== false) {
$data = $decoded;
}
return $data;
}
(Theoretical answer.) Double-base-64-encoded strings are regular, because there is a finite amount of byte sequences that properly base64-encode a base64-encoded message.
You can check if something is base64-encoded once since you can validate each set of four characters. The last four bytes in a base64-encoded message may be a special case because =s are used as padding. Using the regular expression:
<char> := [A-Za-z0-9+/]
<end-char> := [A-Za-z0-9+/=]
<chunk> := <char>{4}
<end-chunk> := <char>{2} <end-char>{2} | <char>{3} <end-char>
<base64-encoded> := <chunk>* <end-chunk>?
You can also determine if something is base64-encoded twice using regular expressions, but the solution is not trivial or pretty, since it's not enough to check 4 bytes at a time.
Example: "QUFBQQ==" base64-decodes to "AAAA" that base64-decodes to three NUL-bytes:
$ echo -n "QUFBQQ==" | base64 -d | xxd
00000000: 4141 4141 AAAA
$ echo -n "AAAA" | base64 -d | xxd
00000000: 0000 00 ...
At this point we could enumerate all double-base64-encodings where the base64-encoding is 4 bytes within the base64 alphabet ("AAAA", "AAAB", "AAAC", "AAAD", etc.) and minimize this:
<ugly 4> := QUFBQQ== | QUFBQg== | QUFBQw== | QUFBRA== | ...
And we could enumerate the first 4 bytes of all double-base64-encodings where the base64-encoding is 8 bytes or longer (cases that don't involve padding with =) and minimize that:
<chunk 4> := QUFB | QkFB | Q0FB | REFB | ...
One partition (the pretty one) of double-base64-encoded strings will not contain =s at the end; their lengths are a multiple of 8:
<pretty double-base64-encoded> := <chunk 4>{2}*
Another partition of double-base64-encoded strings will have lengths that are multiples of 4 but not 8 (4, 12, 20, etc.); they can be thought of as pretty ones with an ugly bit at the end:
<ugly double-base64-encoded> := <chunk 4>{2}* <ugly 4>
We could then construct a combined regular expression:
<double-base64-encoded> := <pretty double-base64-encoded>
| <ugly double-base64-encoded>
As I said, you probably don't want to go through all this mess just because double-base64-encoded messages are regular. Just like you don't want to check if an integer is within some finite interval. Also, this is a good example of getting the wrong answer when you should have been asking another question. :-)
Related
I try to compress data with lz4_compress in php and uncompress data with https://github.com/pierrec/lz4 in golang
but it fails.
it seems that the lz4_compress output misses the lz4 header, and the block data is little different.
please help me solve the problem.
<?php
echo base64_encode(lz4_compress("Hello World!"));
?>
output:
DAAAAMBIZWxsbyBXb3JsZCE=
package main
import (
"bytes"
"encoding/base64"
"fmt"
"github.com/pierrec/lz4"
)
func main() {
a, _ := base64.StdEncoding.DecodeString("DAAAAMBIZWxsbyBXb3JsZCE=")
fmt.Printf("%b\n", a)
buf := new(bytes.Buffer)
w := lz4.NewWriter(buf)
b := bytes.NewReader([]byte("Hello World!"))
w.ReadFrom(b)
fmt.Printf("%b\n", buf.Bytes())
}
output:
[1100 0 0 0 11000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]
[100 100010 1001101 11000 1100100 1110000 10111001 1100 0 0 10000000 1001000 1100101 1101100 1101100 1101111 100000 1010111 1101111 1110010 1101100 1100100 100001]
lz4.h explicitly says
lz4.h provides block compression functions. It gives full buffer control to user.
Decompressing an lz4-compressed block also requires metadata (such as compressed size). Each application is free to encode such metadata in whichever way it wants.
An additional format, called LZ4 frame specification (doc/lz4_Frame_format.md),
take care of encoding standard metadata alongside LZ4-compressed blocks. If your application requires interoperability, it's recommended to use it. A library is provided to take care of it, see lz4frame.h.
The PHP extension doesn't do that; it produces bare compressed blocks.
http://lz4.github.io/lz4/ explicitly lists the PHP extension as not interoperable (in the "Customs LZ4 ports and bindings" section).
Sound good! And now try
echo -n DAAAAMBIZWxsbyBXb3JsZCE= | base64 -d
I got in first 4 bytes is written 0C 00 00 00 - that is the lenght of string and rest is Hello World!. Therefore I think, that if php realize that compression of such a short input is not possible it writes the input (try echo -n "Hello World!" | lz4c ). But problem is it does not allow you recognize such a thing, or I'm wrong?
I have been scouring the ole interweb for this solution but have not found anything successful. I have a CSV output from one script that has data presented in a specific way and i need to match that and merge with another file. Added bonus if i can round up to a simple 2 x decimal points.
File 1: dataset1.csv (using column 1 as a primary key or what i want to search the other file for.)
5033db62b38f86605f0baeccae5e6cbc,20.875,20.625,41.5
5033d9951846c1841437b437f5a97f0a,3.3529411764705882,12.4117647058823529,13.7647058823529412
50335ab3ab5411f88b77900736338bc6,6.625,1.0625,3
5033db62b38f86605f0baeccae5e6cbc,2.9375,1,1.4375
File 2: dataset2.csv (if column 2 matches column 1 of file join column 1 from file 2 replacing the data in column 1 of file 1.)
"dc2","5033db62b38f86605f0baeccae5e6cbc"
"dc1","5033d9951846c1841437b437f5a97f0a"
Desired results:
File 1 (or new file3):
dc1,3.35,12.41,13.76
dc2,20.875,20.625,41.5
Just to demonstrate that I have been trying to find a way, and not just randomly asking a question hoping someone else would solve my problem.
I have found a number of resources that say to use join.
join -o 1.1,1.2,1.3,1.4,2.3 file 1 file 2 etc. I have tested this a number of different ways. I read on a number of posts that the results need to be sorted - with that long of a string its a little hard. Not to mention file 1 may have 30 to 40 entries but file2 may only have 10. I just need a name associated with the long string.
I started looking at grep - but then I will need a forEach loop to cycle through all the results and there has to be an easier way.
I have also looked at AWK - now this is a fun one trying to figure out exactly how to make this work.
awk 'FNR==NR {a[$2]; next} $2 in a' file.csv testfile2.csv
Yeah.... tried many ways to get this to compare as this seems to be the general idea... but still haven't got it to work. I would like this to be some type of shell script for linux to be very simple and something i can call from a php page and have it run. Like if user hits refresh it churns through it and digests the data.
Any help would be greatly appreciated!
Thank you.
j.
You can use a combination of sort and gnu awk:
mergef.awk:
BEGIN { FS= "[ ,\"]+"; }
FNR == NR { if ( !($1 in vals) ) vals [ $1 ] = sprintf("%.2f,%.2f,%.2f", $2, $3,$4) ;}
FNR != NR { print $2 "," vals[ $3 ]; }
Say your files are f1.csv and f2.csv then use this command:
awk -f mergef.awk f1.csv f2.csv | sort
the first line in the script deals with the quotes present in the second file (because of this setting there is an empty field $1 for the second file)
the second line reads in the first file. The if takes care that only the first occurence of a key is used.
the last line prints the new keys from the second file along the stored values from the first file, retrieved via the old keys
FNR == NR is true for the first file
Using python and the pandas library:
import pandas as pd
# Read in the csv files.
df1 = pd.read_csv(dataset1.csv, header=None, index_col=0)
df2 = pd.read_csv(dataset2.csv, header=None, index_col=1)
# Round values in the first file to two decimal places.
df1 = df1.round(2)
# Merge the two files.
df3 = pd.merge(df2, df1, how='inner', left_index=True, right_index=True)
# Write the output.
df3.to_csv(output.csv, index=False, header=False)
except formatting the numbers this does the job
$ join -t, -1 1 -2 2 -o2.1,1.2,1.3,1.4 <(sort file1) <(tr -d '"' <file2 | sort -t, -k2)
dc1,3.3529411764705882,12.4117647058823529,13.7647058823529412
dc2,2.9375,1,1.4375
dc2,20.875,20.625,41.5
note that there two matches for dc2.
Bonus: for required formatting pipe the output of the previous script to
$ ... | tr ',' ' ' | xargs printf "%s,%.2f,%.2f,%.2f\n"
dc1,3.35,12.41,13.76
dc2,2.94,1.00,1.44
dc2,20.88,20.62,41.50
but then, perhaps awk is a better alternative. This is to show that no programming is required if you can utilize existing unix toolset.
Here is a solution with PHP:
foreach (file("dataset1.csv") as $line_no => $csv) {
if (!$line_no) continue; // in case you have a header on first line
$fields = str_getcsv($csv);
$key = array_shift($fields);
$data1[$key] = array_map(function ($v) { return number_format($v, 2); }, $fields);
};
foreach (file("dataset2.csv") as $csv) {
$fields = str_getcsv($csv);
if (!isset($data1[$fields[1]])) continue;
$data2[$fields[0]] = array_merge(array($fields[0]), $data1[$fields[1]]);
};
ksort($data2);
$csv = implode("\n", array_map(function ($v) {
return implode(',', $v);
}, $data2));
file_put_contents("dataset3.csv", $csv);
NB: As you mentioned that the first file will be using column 1 as a primary key, a duplicate key value should not occur. If it does, the last occurrence will prevail.
Firstly, my Java version:
string str = "helloworld";
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes("UTF-8"));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0;i < localByteArrayOutputStream.toByteArray().length;i ++){
System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}
and output is:
31
-117
8
0
0
0
0
0
0
0
-53
72
-51
-55
-55
47
-49
47
-54
73
1
0
-83
32
-21
-7
10
0
0
0
Then the Go version:
var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&gzBf).Bytes()
for i := 0; i < len(GB); i++ {
fmt.Println(GB[i])
}
output:
31
139
8
0
0
9
110
136
0
255
202
72
205
201
201
47
207
47
202
73
1
0
0
0
255
255
1
0
0
255
255
173
32
235
249
10
0
0
0
Why?
I thought it might be caused by different byte reading methods of those two languages at first. But I noticed that 0 can never convert to 9. And the sizes of []byte are different.
Have I written wrong code? Is there any way to make my Go program get the same output as the Java program?
Thanks!
First thing is that the byte type in Java is signed, it has a range of -128..127, while in Go byte is an alias of uint8 and has a range of 0..255. So if you want to compare the results, you have to shift negative Java values by 256 (add 256).
Tip: To display a Java byte value in an unsigned fashion, use: byteValue & 0xff which converts it to int using the 8 bits of the byte as the lowest 8 bits in the int. Or better: display both results in hex form so you don't have to care about sign-ness...
Even if you do the shift, you will still see different results. That might be due to different default compression level in the different languages. Note that although the default compression level is 6 in both Java and Go, this is not specified and different implementations are allowed to choose different values, and it might also change in future releases.
And even if the compression level would be the same, you might still encounter differences because gzip is based on LZ77 and Huffman coding which uses a tree built on frequency (probability) to decide the output codes and if different input characters or bit patterns have the same frequency, assigned codes might vary between them, and moreover multiple output bit patterns might have the same length and therefore a different one might be chosen.
If you want the same output, the only way would be (see notes below!) to use the 0 compression level (not to compress at all). In Go use the compression level gzip.NoCompression and in Java use the Deflater.NO_COPMRESSION.
Java:
GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
{
def.setLevel(Deflater.NO_COMPRESSION);
}
};
Go:
gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)
But I wouldn't worry about the different outputs. Gzip is a standard, even if outputs are not the same, you will still be able to decompress the output with any gzip decoders whichever was used to compress the data, and the decoded data will be exactly the same.
Here are the simplified, extended versions:
Not that it matters, but your codes are unneccessarily complex. You could simplify them like this (these versions also include setting 0 compression level and converting negative Java byte values):
Java version:
ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
{ def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write("helloworld".getBytes("UTF-8"));
gz.close();
for (byte b : buf.toByteArray())
System.out.print((b & 0xff) + " ");
Go version:
var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&buf, gzip.NoCompression)
gz.Write([]byte("helloworld"))
gz.Close()
fmt.Println(buf.Bytes())
NOTES:
The gzip format allows some extra fields (headers) to be included in the output.
In Go these are represented by the gzip.Header type:
type Header struct {
Comment string // comment
Extra []byte // "extra data"
ModTime time.Time // modification time
Name string // file name
OS byte // operating system type
}
And it is accessible via the Writer.Header struct field. Go sets and inserts them, while Java does not (leaves header fields zero). So even if you set compression level to 0 in both languages, the output will not be the same (but the "compressed" data will match in both outputs).
Unfortunately the standard Java does not provide a way/interface to set/add these fields, and Go does not make it optional to fill the Header fields in the output, so you will not be able to generate exact outputs.
An option would be to use a 3rd party GZip library for Java which supports setting these fields. Apache Commons Compress is such an example, it contains a GzipCompressorOutputStream class which has a constructor which allows a GzipParameters instance to be passed. This GzipParameters is the equvivalent of the gzip.Header structure. Only using this would you be able to generate exact output.
But as mentioned, generating exact output has no real-life value.
From RFC 1952, the GZip file header is structured as:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Looking at the output you've provided, we have:
| Java | Go
ID1 | 31 | 31
ID2 | 139 | 139
CM (compression method) | 8 | 8
FLG (flags) | 0 | 0
MTIME (modification time) | 0 0 0 0 | 0 9 110 136
XFL (extra flags) | 0 | 0
OS (operating system) | 0 | 255
So we can see that Go is setting the modification time field of the header, and setting the operating system to 255 (unknown) rather than 0 (FAT file system). In other respects they indicate that the file is compressed in the same way.
In general these sorts of differences are harmless. If you want to determine if two compressed files are the same, then you should really compare the decompressed versions of the files though.
I am developing a PHP application where large amounts of text needs to be stored in a MySQL database. Have come across PHP's gzcompress and MySQL's COMPRESS functions as possible ways of reducing the stored data size.
What is the difference, if any, between these two functions?
(My current thoughts are gzcompress seems more flexible in that it allows the compression level to be specified, whereas COMPRESS may be a bit simpler to implement and better decoupling? Performance is also a big consideration.)
The two methods are more or less the same thing, in fact you can mix them: compress in php and uncompress in MySQL and vice versa.
To compress in MySQL:
INSERT INTO table (data) VALUE(COMPRESS(data));
To compress in PHP:
$compressed_data = "\x1f\x8b\x08\x00".gzcompress($uncompressed_data);
To uncompress in MySQL:
SELECT UNCOMPRESS(data) FROM table;
To uncompress in PHP:
$uncompressed_data = gzuncompress(substr($compressed_data, 4));
Another option is to use MySQL table compression.
It only require configuration and then it is transparent.
This may be an old question, but it's important as a Google search destination. The results of MySQL's COMPRESS() vs PHP's gzcompress() are the same EXCEPT for MySQL puts a 4-byte header on the data, which indicates the uncompressed data length. You can easily ignore the first 4 bytes from MySQL's COMPRESS() and feed it to gzuncompress() and it will work, but you cannot take the results of PHP's gzcompress() and use MySQL's UNCOMPRESS() on it, unless you take specific care to add in that 4-byte length header, which of course requires having the uncompressed data already...
The accepted answer does not use the right 4 byte header.
The first 4 bytes are the LENGTH and not a static header.
I have no idea about the implications of using a wrong length but it can not be good and has the potential to crash the database or table content in the future (if not now)
The correct answer with POC example:
Output of mysql:
mysql : "select hex(compress('1234512345'))"
0A000000789C3334323631350411000AEB01FF
The php equivalent:
They both use zlib, so the compression will likely be about the same. Test it and see.
Adding this answer for reference, as I needed to use uncompress() to decompress data where the decompressed size was stored in a separate column to the blob.
As per the previous answers, uncompress() expects the first 4 bytes of the compressed data to be the length, stored in little-endian format. This can be prepended using concat e.g.
select uncompress(
concat(
char(size & 0x000000ff),
char((size & 0x0000ff00) >> 8),
char((size & 0x00ff0000) >> 16),
char((size & 0xff000000) >> 24),
compressed_data)) as decompressed
from my_blobs;
Johns answer is almost correct. The length must be computed by using strlen instead of mb_strlen as the latter will recognize multibyte characters as "1 character" although they span multiple bytes. Take the following example with a "▄" character that consists of 3 bytes:
$string="▄";
$compressed = gzcompress($string, 6);
echo "with strlen\n";
$len = strlen($string);
$head = pack('V', $len);
$base64 = base64_encode($head.$compressed);
echo "Length of string: $len\n";
echo $base64."\n";
echo `mysql -e "SELECT UNCOMPRESS(FROM_BASE64('$base64'))" -u root -proot -h mysql`;
echo "\n\nwith mb_strlen\n";
$len = mb_strlen($string);
$head = pack('V', $len);
$base64 = base64_encode($head.$compressed);
echo "Length of string: $len\n";
echo $base64."\n";
echo `mysql -e "SELECT UNCOMPRESS(FROM_BASE64('$base64'))" -u root -proot -h mysql`;
Output:
with strlen
Length of string: 3
AwAAAHicezStBQAEWQH9
UNCOMPRESS(FROM_BASE64('AwAAAHicezStBQAEWQH9'))
▄
with mb_strlen
Length of string: 1
AQAAAHicezStBQAEWQH9
UNCOMPRESS(FROM_BASE64('AQAAAHicezStBQAEWQH9'))
NULL
I have just bought a GPS Tracker, it can send SMS to cellphone just fine. It also supports reporting to a server via GPRS.
I have setup the device to contact my own server on port 8123, it's a FreeBSD server and i have checked that i recieve packets on that port.
I successfully have setup a listener server written in PHP, and i can receive data from the device. But how do i convert the partial HEX data to something usefull (ASCII)?
Example data string:
$$^#T^#E Y'^WÿU210104.000,A,5534.4079,N,01146.2510,E,0.00,,170411,,*10|1.0|72|0000á
Unfortunately i don't know how i can copy-paste the HEX parts
Now how do i get the ID part out? I have tried echo hexdec(mb_substr($data, 4, 7));
The data is following this protocol
From the document:
Command format of GPRS packets are as follows:
From server to tracker:
##\r\n
From tracker to server:
$$\r\n
Note:
Do NOT input ‘’ when writing a command.
All multi-byte data complies with the following sequence: High byte prior to low byte.
The size of a GPRS packet (including data) is about 100 bytes
Item Specification
## 2 bytes. It means the header of packet from server to tracker.
It is in ASCII code (Hex code: 0x40)
$$ 2 bytes. It is the header of packet from tracker to server.
It is in ASCII code (Hex code: 0x24)
L 2 bytes. It means the length of the whole packet including
the header and ending character and it is in hex code
ID 7 bytes, ID must be digit and not over 14 digits, the unused byte
will be stuffed by ‘f’ or ‘0xff’. It is in the format of hex code.
For example, if ID is 13612345678, then it will be shown as
follows: 0x13, 0x61, 0x23, 0x45, 0x67, 0x8f, 0xff.
If all 7 bytes are 0xff, it is a broadcasting command. ID is in hex code
command 2 bytes. The command code is in hex code. Please refer to the
command list below.
data Min 0 byte and max 100 bytes. See Annex 1 for description of ‘data’.
checksum 2 bytes. It indicates CRC-CCITT (default is 0xffff) checksum of
all data (not including CRC itself and the ending character).
It is in hex code.
For example: 24 24 00 11 13 61 23 45 67 8f ff 50 00 05 d8 0d 0a
0x05d8 = CRC-CCITT (24 24 00 11 13 61 23 45 67 8f ff 50 00)
\r\n 2 bytes. It is the ending character and in hex code
(0x0d,0x0a in hex code)
Update
With the answer from Anomie, i was able to piece this together
$arr = unpack('H4length/H14id/H4cmd/H4crc/H4end', mb_substr($data, 2, 11) . mb_substr($data, -4));
var_dump($arr);
This will out put something like
array(5) {
["length"]=>
string(4) "0054"
["id"]=>
string(14) "004512345678ff"
["cmd"]=>
string(4) "9955"
["crc"]=>
string(4) "c97e"
["end"]=>
string(4) "0d0a"
}
It sounds like you are needing to convert binary data to integers or strings. The most straightforward way is to use unpack.
For example, to extract the length you already know you can use
$length_bin = substr($string, 2, 2);
To convert it to an integer, you can use something like
$length = unpack('v', $length_bin); $length = $length[1];
The 'v' code will work for the length and the checksum; if you have a number stored as 4 bytes use 'V', and for the ID you can use 'H*' to get it as a string of hex digits. Other codes are listed in the documentation.
A somewhat less straightforward way is to do the bit manipulation manually, after using unpack with C* to get an array of all the byte values. For example,
$bytes = unpack('C*', $length_bin);
$length = ($bytes[0] << 8) | $bytes[1];
You need to know the format of the messages you are going to receive from the device. You can get this info from the manufacturer. Then, depending on that, you have to create a proper listener in the server side.
I've been working with several devices like that and normally you have to create a process in the server listening to the port with a Socket (or similar). You may have an authentication process also to differentiate between devices (you can have more than one). After that, you simply get the data from the device, you parse it and you store it. Depending on the device you can also send requests or configurations.
Hope this helps
*Edit 26 April:* I have changed the question a bit, thus this seems out of place. Initial question was more on how to read the data from TCP.
I found some great articles on writing a TCP/socket server in PHP (/me slaps PHP around a bit with a large trout)
http://devzone.zend.com/article/1086
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/
Can't wait to get this going :)