MySQL COMPRESS vs PHP gzcompress - php

I am developing a PHP application where large amounts of text needs to be stored in a MySQL database. Have come across PHP's gzcompress and MySQL's COMPRESS functions as possible ways of reducing the stored data size.
What is the difference, if any, between these two functions?
(My current thoughts are gzcompress seems more flexible in that it allows the compression level to be specified, whereas COMPRESS may be a bit simpler to implement and better decoupling? Performance is also a big consideration.)

The two methods are more or less the same thing, in fact you can mix them: compress in php and uncompress in MySQL and vice versa.
To compress in MySQL:
INSERT INTO table (data) VALUE(COMPRESS(data));
To compress in PHP:
$compressed_data = "\x1f\x8b\x08\x00".gzcompress($uncompressed_data);
To uncompress in MySQL:
SELECT UNCOMPRESS(data) FROM table;
To uncompress in PHP:
$uncompressed_data = gzuncompress(substr($compressed_data, 4));
Another option is to use MySQL table compression.
It only require configuration and then it is transparent.

This may be an old question, but it's important as a Google search destination. The results of MySQL's COMPRESS() vs PHP's gzcompress() are the same EXCEPT for MySQL puts a 4-byte header on the data, which indicates the uncompressed data length. You can easily ignore the first 4 bytes from MySQL's COMPRESS() and feed it to gzuncompress() and it will work, but you cannot take the results of PHP's gzcompress() and use MySQL's UNCOMPRESS() on it, unless you take specific care to add in that 4-byte length header, which of course requires having the uncompressed data already...

The accepted answer does not use the right 4 byte header.
The first 4 bytes are the LENGTH and not a static header.
I have no idea about the implications of using a wrong length but it can not be good and has the potential to crash the database or table content in the future (if not now)
The correct answer with POC example:
Output of mysql:
mysql : "select hex(compress('1234512345'))"
0A000000789C3334323631350411000AEB01FF
The php equivalent:

They both use zlib, so the compression will likely be about the same. Test it and see.

Adding this answer for reference, as I needed to use uncompress() to decompress data where the decompressed size was stored in a separate column to the blob.
As per the previous answers, uncompress() expects the first 4 bytes of the compressed data to be the length, stored in little-endian format. This can be prepended using concat e.g.
select uncompress(
concat(
char(size & 0x000000ff),
char((size & 0x0000ff00) >> 8),
char((size & 0x00ff0000) >> 16),
char((size & 0xff000000) >> 24),
compressed_data)) as decompressed
from my_blobs;

Johns answer is almost correct. The length must be computed by using strlen instead of mb_strlen as the latter will recognize multibyte characters as "1 character" although they span multiple bytes. Take the following example with a "▄" character that consists of 3 bytes:
$string="▄";
$compressed = gzcompress($string, 6);
echo "with strlen\n";
$len = strlen($string);
$head = pack('V', $len);
$base64 = base64_encode($head.$compressed);
echo "Length of string: $len\n";
echo $base64."\n";
echo `mysql -e "SELECT UNCOMPRESS(FROM_BASE64('$base64'))" -u root -proot -h mysql`;
echo "\n\nwith mb_strlen\n";
$len = mb_strlen($string);
$head = pack('V', $len);
$base64 = base64_encode($head.$compressed);
echo "Length of string: $len\n";
echo $base64."\n";
echo `mysql -e "SELECT UNCOMPRESS(FROM_BASE64('$base64'))" -u root -proot -h mysql`;
Output:
with strlen
Length of string: 3
AwAAAHicezStBQAEWQH9
UNCOMPRESS(FROM_BASE64('AwAAAHicezStBQAEWQH9'))
▄
with mb_strlen
Length of string: 1
AQAAAHicezStBQAEWQH9
UNCOMPRESS(FROM_BASE64('AQAAAHicezStBQAEWQH9'))
NULL

Related

How to store and retrieve extended ASCII characters in MSSQL

I was surprised that I was unable to find a straightforward answer to this question by searching.
I have a web application in PHP that takes user input. Due to the nature of the application, users may often use extended ASCII characters (a.k.a. "ALT codes").
My specific issue at the moment is with ALT code 26, which is a right arrow (→). This will be accompanied with other text to be stored in the same field (for example, 'this→that').
My column type is NVARCHAR.
Here's what I've tried:
I've tried doing no conversions and just inserting the value as normal, but the value gets stored as thisâ??that.
I've tried converting the value to UCS-2 in PHP using iconv('UTF-8', 'UCS-2', $value), but I get an error saying Unclosed quotation mark after the character string 't'.. The query ends up looking like this: UPDATE myTable SET myColumn = 'this�!that'.
I've tried doing the above conversion and then adding an N before the quoted value, but I get the same error message. The query looks like this: UPDATE myTable SET myColumn = N'this�!that'.
I've tried removing the UCS-2 conversion and just adding the N before the quoted value, and the query works again, but the value is stored as thisâ that.
I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.
So can anyone answer the (seemingly simple) question of, how can I store this value in my database and then retrieve it as it was originally typed?
I'm using PHP 5.5 and MSSQL 2012. If any question of driver/OS version comes into play, it's a Linux server connecting via FreeTDS. There is no possibility of changing this.
You might try base64 encoding the input, this is fairly trivial to handle with PHP's base64_encode() and base64_decode() and it should handle what ever your users throw at it.
(edit: You can apparently also do the base64 encoding on the SQL Server side. This doesn't seem like something it should be responsible for imho, but it's an option.)
It seems like your freetds.conf is wrong. You need a TDS protocol version >= 7.0 to support unicode. See this for more details.
Edit your freetds.conf:
[global]
# TDS protocol version
tds version = 7.4
client charset = UTF-8
Also make sure to configure PHP correct:
ini_set('mssql.charset', 'UTF-8');
The accepted answer seems to do the job; yes you can encode it to base64 and then decode it back again, but then all the applications that use that remote database, should change and support the fields to be base64 encoded. My thought is that if there is a remote MS SQL Server database, there could be an other application (or applications) that may use it, so that application have to also be changed to support both plain and base64 encoding. And you'll have to also handle both plain text and base64 converted text.
I searched a little bit and I found how to send UNICODE text to the MS SQL Server using MS SQL commands and PHP to convert the UNICODE bytes to HEX numbers.
If you go at the PHP documentation for the mssql_fetch_array (http://php.net/manual/ru/function.mssql-fetch-array.php#80076), you'll see at the comments a pretty good solution that converts the text to UNICODE HEX values and then sends that HEX data directly to MS SQL Server like this:
Convert Unicode Text to HEX Data
// sending data to database
$utf8 = 'Δοκιμή με unicode → Test with Unicode'; // some Greek text for example
$ucs2 = iconv('UTF-8', 'UCS-2LE', $utf8);
// converting UCS-2 string into "binary" hexadecimal form
$arr = unpack('H*hex', $ucs2);
$hex = "0x{$arr['hex']}";
// IMPORTANT!
// please note that value must be passed without apostrophes
// it should be "... values(0x0123456789ABCEF) ...", not "... values('0x0123456789ABCEF') ..."
mssql_query("INSERT INTO mytable (myfield) VALUES ({$hex})", $link);
Now all the text actually is stored to the NVARCHAR database field correctly as UNICODE, and that's all you have to do in order to send and store it as plain text and not encoded.
To retrieve that text, you need to ask MS SQL Server to send back UNICODE encoded text like this:
Retrieving Unicode Text from MS SQL Server
// retrieving data from database
// IMPORTANT!
// please note that "varbinary" expects number of bytes
// in this example it must be 200 (bytes), while size of field is 100 (UCS-2 chars)
// myfield is of 50 length, so I set VARBINARY to 100
$result = mssql_query("SELECT CONVERT(VARBINARY(100), myfield) AS myfield FROM mytable", $link);
while (($row = mssql_fetch_array($result, MSSQL_BOTH)))
{
// we get data in UCS-2
// I use UTF-8 in my project, so I encode it back
echo '1. '.iconv('UCS-2LE', 'UTF-8', $row['myfield'])).PHP_EOL;
// or you can even use mb_convert_encoding to convert from UCS-2LE to UTF-8
echo '2. '.mb_convert_encoding($row['myfield'], 'UTF-8', 'UCS-2LE').PHP_EOL;
}
The MS SQL Table with the UNICODE Data after the INSERT
The output result using a PHP page to display the values
I'm not sure if you can reach my test page here, but you can try to see the live results:
http://dbg.deve.wiznet.gr/php56/mssql/test1.php

Piecemeal bzcompression for large files in PHP

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of
$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);
I tried out a piecemeal bzcompression using the routines shown below
function makeBZFile($infile,$outfile)
{
$fp = fopen($infile,'r');
$bz = bzopen($outfile,'w');
while (!feof($fp))
{
$bytes = fread($fp,10240);
bzwrite($bz,$bytes);
}
bzclose($bz);
fclose($fp);
}
function unmakeBZFile($infile,$outfile)
{
$bz = bzopen($infile,'r');
while (!feof($bz))
{
$str = bzread($bz,10240);
file_put_contents($outfile,$str,FILE_APPEND);
}
}
set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');
To test this code I did two things
I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1
In both cases I performed a diff original.file decompressed.file and found that the two were identical.
All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.
What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.
I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.
I'd much appreciate a few explanations here.
To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.
BZIP2
Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).
BZIP2 Data Structure
The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).
Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.
GZIP/BZIP
Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.
GIP explained

Get compressed byte size after zlib_decode()?

I'm trying to use PHP to parse a custom gzip archive file format that was created in Delphi (not my code!). The format is basically:
4-byte integer: count of files in archive
for each compressed file:
4-byte integer: filename length [n]
[n] bytes: filename
4-byte integer: uncompressed file length [m]
[????] bytes: gzipped content
I can read the file and actually decode the first compressed file correctly by using zlib_decode() with a max uncompressed length of [m] bytes on the remainder of the file after I know the length ([m]), but then I'm stuck because I don't know how far into the substring I should go to find the next filename -- zlib_decode() doesn't return the number of compressed bytes that it processed before stopping. Since this is a custom format, it doesn't seem like I can use the normal gzopen()/gzread() functions because the entire file isn't compressed (I tried, it doesn't work).
This code works in Delphi because apparently you can pass a file handle back and forth between normal file reading functions and the System.ZLib decoding functions -- you can read [m] uncompressed bytes and the pointer will remain at the last compressed byte -- but PHP doesn't seem to support switching between read-as-normal and read-as-gzip on the fly that way.
Am I missing an obvious way in PHP to deal with a mixed-content file format like this, where metadata and compressed data are stacked together this way? Or am I out of luck without knowing the compressed data length?
A dirty workaround is to recompress the content of each file as I am able to parse it, use that to calculate the compressed length, and adjust the file pointer in the original file manually as follows:
$current_pos = ftell($handle);
$skip_length = strlen(gzencode($uncompressed_text,9,FORCE_DEFLATE));
fseek($handle, $skip_length+$current_pos);
This works, but feels very hack-ish. I'd still be open to any better approaches.
EDIT:
Just a note that this eventually failed. However, I was fortunate enough to know in advance the list of expected filenames and I was able to do the following (more reliable since zlib_decode() will decode as much as it can and discard the rest anyway):
foreach ($filenames as $thisFilename) {
$thisPos = strpos($rawData, $thisFilename);
$gzresult = zlib_decode(substr($rawData, $thisPos + strlen($table) + 8)); // skip 8 bytes for filename size and uncompressed data size, which are useless info.
}

Get Zipfile as Bytes in php?

Using php, how can I read a zip file and get its bytes, for example something like
$contents = file_get_contents('myzipfile.zip');
echo $contents;
// outputs: 504b 0304 1400 0000 0800 1bae 2f46 20e0
Thank you!
file_get_contents gets the raw bytes, your echo outputs those raw bytes. If you expect to output a hexadecimal notation of the raw byte contents instead, use bin2hex:
echo bin2hex($contents);
If you want that arbitrarily grouped with a space every two bytes, you can do something along these lines:
echo join(' ', str_split(bin2hex($contents), 4));
(Note that this is all rather inefficient, modifying the entire, possibly many megabyte large file in memory. I'm expecting this is just for debugging purposes, so won't go out of my way to write super efficient code.)
file_get_contents() will return the exact contents of the file, so the format depends on the file type.
If you are looking for the byte size of the file you can get any available file information with the core SPL library's fileInfo class:
$info = new SplFileInfo('myzipfile.zip');
$bytes = $info->getSize();

string contain many '\0' after inflate

I try to decompress blocks of data which were compressed with zlib and author made remarks that for decompress i must use inflate_init and inflate with Z_SYNC_FLUSH. I sure that this must work because that works on php in this way :
$temp = substr($temp, 2, -4);
$temp{0} = chr(ord($temp{0}) | 1);
$temp = gzinflate($temp);
but i ckecked many method for decompress this on C++ and every time fail.
Here is one of them :
char compressedblockbuffer[3371];
char uncompressedblockbuffer[8192];
is.read(compressedblockbuffer, 3371);
z_stream strm;
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 3371;
strm.next_in = (Bytef *)compressedblockbuffer;
strm.avail_out = 8192;
strm.next_out = (Bytef *)uncompressedblockbuffer;
inflateInit(&strm);
inflate(&strm, Z_SYNC_FLUSH);
inflateEnd(&strm);
It's not full code, just example to show problem and thats why i specified already known sizes.
I use last zlib realize so may be something change in the zlib inflate since 2003-2004 years?
So the result is :
So seems that uncompressedblockbuffer contains '\0' at the 2,3,4 indexes and many other and if i print this to console i just see two first elements.
UPD:
If gzinflate() in PHP works on the data, then your code won't. gzinflate() expects raw deflate data. Your code is looking for zlib-wrapped deflate data. If you want to decode raw deflate data, you need to use inflateInit2(&strm, -15) instead.
Your call to inflate() is likely returning an error that you are not checking for. You need to always check the return codes of the zlib routines, or for that matter any function that has the potential to return an error.
What kind of data are you decompressing? Many binary formats are perfectly accepting of NUL bytes in their data, since it just reads as a value of 0. For example, inside of image data in many formats, it'd just represent a value of 0 in either that channel or pixel (depending on data size). Not to mention, binary formats don't necessarily read as bytes. A NUL byte may actually be a part of a 2- or 4-byte value.
This is the problem with trying to read binary data as a character string. Binary data needn't follow the rules of text. This is why usually the data boundary is a separate size value, because it can't terminate on NUL values like text.
If you have the original uncompressed data for comparison, either load that data into memory and compare the data, or save the decompressed data to a file and use a diff tool to do a binary comparison of the files.

Categories