PHP gzcompress vs gzopen/gzwrite

PHP gzcompress vs gzopen/gzwrite - php

I'm writing a PHP script that generates gzipped files. The approach I've been using is to build up a string in PHP and gzcompress() the string before writing it out to a file at the end of the script.
Now I'm testing my script with larger files and running into memory allocation errors. It seems that the result string is becoming too large to hold in memory at one time.
To solve this I've tried to use gzopen() and gzwrite() to avoid allocating a large string in PHP. However, the gzipped file generated with gzwrite() is very different from when I use gzcompress(). I've experimented with different zip levels but it doesn't help. I've also tried using gzdeflate() and end up with the same results as gzwrite(), but still not similar to gzcompress(). It's not just the first two bytes (zlib header) that are different, it's the entire file.
What does gzcompress() do differently from these other gzip functions in PHP? Is there a way I can emulate the results of gzcompress() while incrementally producing the result?

The primary difference is that the gzwrite function initiates zlib with the SYNC_FLUSH option, which will pad the output to a 4 byte boundary (or is it 2), and then a little extra (0x00 0x00 0xff 0xff 0x03).
If you are using these to create Zip files, beware that the default Mac Archive utility does NOT accept this format.
From what I can tell, SYNC_FLUSH is a gzip option, and is not allowed in the PKZip/Info-ZIP format, all .zip files and their derivatives come from.
If you deflate a small file/text, resulting in a single deflate block, and compare it to the same text written with gzwrite, you'll see 2 differences, one of the bytes in the header of the deflate block is different by 1, and the end is padded with the above bytes. If the result is larger than one deflate block, the differences start piling up. It is hard to fix this, as the deflate stream block headers aren't even byte aligned. There is a reason everybody uses the zlib. Few people are brave enough to even attempt to rewrite that format!

I am not 100% certain, but my guess is that gzcompress uses GZIP format, and gzopen/gzwrite use ZLIB. Honestly, I can't tell you what the difference between the two is, but I do know that GZIP uses ZLIB for the actual compression.
It is possible that none of that will matter though. Try creating a gzip file with gzopen/gzwrite and then decompress it using the command-line gzip program. If it works, then using gzopen/gzwrite will work for you.

I ran into a similar problem once - basically there wasn't enough ram allocated to php to do the business.
I ended up saving the string as a text file, then using exec() to gzip the file using the filesystem. Its not an ideal solution but it worked for my situation.

try increasing the memory_limit parameter in your php.ini file

Both gzcompress() and gzopen() use the DEFLATE method for compressing blocks.
But they have different header/trailer.

Related

PHP or Apache limiting Content-Length in HTML Header?

I have to distribute a huge file to some people (pictures of a prom) via my Apache2/PHP server which is giving me some headaches: Chrome and Firefox both show a filesize of 2GB but the file is actually >4GB, so I started to track things down.
I am doing the following thing in my php script:
header("Content-Length: ".filesize_large($fn));
header("Actual-File-Size: ".filesize_large($fn)); //Debug
readfile($fn);
filesize_large() is returning the correct filesize for >4gb files as a string (yes, even on 32-bit PHP).
Now the interesting part; the actual HTTP header:
Content-Length: 2147483647
Actual-File-Size: 4236525700
So the filesize_large() method is working totally fine, but PHP or Apache somehow limit the value of Content-Length?! Why that?
Apache/2.2.22 x86, PHP 5.3.10 x86, I am using SSL over https
Just so you guys believe me when I say filesize_large() is correct:
function filesize_large($filename)
{
return trim(shell_exec('stat -c %s '.escapeshellarg($filename)));
}
Edit:
Seems like PHP casts the content length to an integer when communicating with apache2 over the sapi interface on 32-bit systems. No workaround sadly except not including the Content-Size in case of files >2GB
Workaround (and actually a far better solution in the first place): Use mod_xsendfile

You have to use 64-bit operation system in order to support long integers for Content-length header.
I would recommend to use Vagrant for development.
Header based on strings, but content length based on int. If take a look here https://books.google.com/books?id=HTo_AmTpQPMC&pg=PA130&lpg=PA130&dq=ap_set_content_length%28r,+r-%3Efinfo.size%29;&source=bl&ots=uNqmcTbKYy&sig=-Wth33sukeEiSnUUwVJPtyHSpXU&hl=en&sa=X&ei=GP0SVdSlFM_jsATWvoGwBQ&ved=0CDEQ6AEwAw#v=onepage&q=ap_set_content_length%28r%2C%20r-%3Efinfo.size%29%3B&f=false
you will see example of ap_set_content_length(); function which was used to serve content-length response. It accepts file length from system function. Try to call php filesize() and you'll probably see the same result.
If you take a look into ap_set_content_length declaration http://ci.apache.org/projects/httpd/trunk/doxygen/group__APACHE__CORE__PROTO.html#ga7ab393c56cf073ce7aadc3b7ca3db7b2
you will see that length declared as apr_off_t.
And here http://svn.haxx.se/dev/archive-2004-01/0871.shtml you can read, that this type depends from compiler options which is 32bit in your case.
I would recommend you to read source code of Apache and PHP projects.

php gzip match unix gzip

I am trying to use php to create a file and zip it. How can i match the same compression levels/headers/and so on as gzip that is run in unix?
using php
ls -l
total 8
-rw-rw-r-- 1 owner owner 486 Jul 21 17:05 file.xml.gz
using gzip on unix command line:
ls -l
total 8
-rw-rw-r-- 1 owner owner 479 Jul 21 17:05 file.xml.gz
in php
$zip = gzencode($xml,2);
i have tried 0 through 9 as the compression level here, i have also tried
$zip = gzencode($xml,x,FORCE_DEFLATE)
again where x is 0-9
my problem is this:
I have a 3rd party vendor that takes the gzipped file, unzips it and does fun things with it. The problem i am running into is when i use php i get an error "cannot parse file.xm.gz", when i use gzip on cli it works fine. I have no vision into what the 3rd party is doing or why its failing. Could it be something like carriage returns or spaces or something in the xml? I know its a tough question to answer. heres a snippet of my php xml.
$xml ='<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<localRoutes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
';
$xml.='<route>
<user type="string">' . $mac . '</user>
';
$xml.='<next type="regex">!^(.*$)!sip:#' . $ip . "</next>
</route>
";
$xml .= '</localRoutes>';

The compressed data is identical. What's missing is a field in the header indicating the original name of the file (e.g, probably file.xml here). This field is generated by the gzip utility, but the gzencode() PHP function doesn't have an original filename to work with, so it doesn't write this field.
I'm not aware of any way to make PHP generate this field with the zlib extension. Its absence is very unlikely to cause any problems, though.

You can't.
You don't need to.
First, gzip and zlib (which is what php is using) have different compression algorithms, and so for large enough data, they will never produce the same compressed data, even at the same compression level.
Second, as noted by #duskwuff, you will not be able to replicate the same gzip header, unless you pull off the gzip header that php made and write your own. The modification dates in the headers will be different. The way you're doing it, one will have a file name and one will not. Though you can invoke gzip with -n to not store the file name.
Third, there is no reason to try to make the results identical. All that matters is that both decompress to the same thing. Which they will.

gzcompress won't produce a valid zipped file?

Consider this:
$text = "hello";
$text_compressed = gzcompress($text, 6);
$success = file_put_contents('file.gz', $text_compressed);
When i try to open file.gz, i get errors. How can i open file.gz under terminal without calling php? (using gzuncompress works just fine!)
I can't recode every file i did, since that i now have almost a Billion files encoded this way! So if there is a solution... :)

You need to use gzencode() instead.
Luckily for you, the fix is easy: just write a script that opens each of your files one by one, uses gzuncompress() to uncompress that file, and then writes that file back out with gzencode() instead of gzcompress(), repeating the process for all of the files.
Alternatively (since you said you "didn't want to recode your files"), you could use uncompress to open the existing files from the command line (instead of gunzip/zcat).
As noted on the gzcompress() manual page:
gzcompress
This is not the same as gzip compression, which includes some header data. See gzencode() for gzip compression.

As said, you don't really have gzipped files. To open your files from a terminal you need the uncompress utility.

LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echo it to the user or save it to disk and redirect (deleting it some time in the future).
However, doing things that way has drawbacks:
a initial phase of intensive CPU and disk thrashing, resulting in...
a considerable initial delay to the user while the archive is prepared
very high memory footprint per request
use of substantial temporary disk space
if the user cancels the download half way through, all resources used in the initial phase (CPU, memory, disk) will have been wasted
Solutions like ZipStream-PHP improve on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.
In contrast, consider the following bash snippet:
ls -1 | zip -# - | cat > file.zip
# Note -# is not supported on MacOS
Here, zip operates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zip works only as fast as its output can be written by cat.
The optimal way, then, would be to do the same: replace cat with a web server process, streaming the zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.
How can you achieve this on a LAMP stack?

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.
Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).
(Note: don't use flush(). See the update below for details.)
Something like the following can do the trick:
<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');
// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');
// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
$buff = fread($fp, $bufsize);
echo $buff;
}
pclose($fp);
You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.
If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.
Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.
Update (from Benji's excellent feedback in the comments section on this answer)
1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.
[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.
There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.
The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).
2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.
3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.
4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:
Content-type: application/zip
or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):
Content-type: application/octet-stream
and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):
Content-disposition: attachment; filename="file.zip"
One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?
Finally, here's a revised example that uses all of #Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):
<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');
// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');
// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
$buff = fread($fp, $bufsize);
echo $buff;
}
pclose($fp);
Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.
The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

Another solution is my mod_zip module for Nginx, written specifically for this purpose:
https://github.com/evanmiller/mod_zip
It is extremely lightweight and does not invoke a separate "zip" process or communicate via pipes. You simply point to a script that lists the locations of files to be included, and mod_zip does the rest.

Trying to implement a dynamic generated download with lots of files with different sizes i came across this solution but i run into various memory errors like "Allowed memory size of 134217728 bytes exhausted at ...".
After adding ob_flush(); right before the flush(); the memory errors disappear.
Together with sending the headers, my final solution looks like this (Just storing the files inside the zip without directory structure):
<?php
// Sending headers
header('Content-Type: application/zip');
header('Content-Disposition: attachment; filename="download.zip"');
header('Content-Transfer-Encoding: binary');
ob_clean();
flush();
// On the fly zip creation
$fp = popen('zip -0 -j -q -r - file1 file2 file3', 'r');
while (!feof($fp)) {
echo fread($fp, 8192);
ob_flush();
flush();
}
pclose($fp);

I wrote this s3 steaming file zipper microservice last weekend - might be useful: http://engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/

According to the PHP manual, the ZIP extension provides a zip: wrapper.
I have never used it and I don't know its internals, but logically it should be able to do what you're looking for, assuming that ZIP archives can be streamed, which I'm not entirely sure of.
As for your question about the "LAMP stack" it shouldn't be a problem as long as PHP is not configured to buffer output.
Edit: I'm trying to put a proof-of-concept together, but it seems not-trivial. If you're not experienced with PHP's streams, it might prove too complicated, if it's even possible.
Edit(2): rereading your question after taking a look at ZipStream, I found what's going to be your main problem here when you say (emphasis added)
the operative Zipping should operate in streaming mode, ie processing files and providing data at the rate of the download.
That part will be extremely hard to implement because I don't think PHP provides a way to determine how full Apache's buffer is. So, the answer to your question is no, you probably won't be able to do that in PHP.

It seems, you can eliminate any output-buffer related problems by using fpassthru(). I also use -0 to save CPU time since my data is compact already. I use this code to serve a whole folder, zipped on-the-fly:
chdir($folder);
$fp = popen('zip -0 -r - .', 'r');
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="'.basename($folder).'.zip"');
fpassthru($fp);

I just released a ZipStreamWriter class written in pure PHP userland here:
https://github.com/cubiclesoft/php-zipstreamwriter
Instead of using external applications (e.g. zip) or extensions like ZipArchive, it supports streaming data into and out of the class by implementing a full-blown ZIP writer.
How the streaming aspect works is by using the ZIP file format's "Data Descriptors" as described by section 4.3.5 of the PKWARE ZIP file specification:
4.3.5 File data MAY be followed by a "data descriptor" for the file.
Data descriptors are used to facilitate ZIP file streaming.
There are some possible limitations to be aware of though. Not every tool can read streaming ZIP files. Also, support for Zip64 streaming ZIP files may have even less support but that's only of concern for files over 2GB with this class. However, both 7-Zip and the Windows 10 built-in ZIP file reader seem to be fine with handling all of crazy files that the ZipStreamWriter class threw at them. The hex editor I use got a good workout too.
When using the ZipStreamWriter class, I recommend allowing a buffer to build up to at least 4KB but no more than 65KB at a time before sending it on to the web server. Otherwise, for lots of really tiny files, you'll be flushing out tiny bits of piecemeal data and waste a bunch of extra CPU cycles on the Apache callback end of things.
When something doesn't exist or I don't like the existing options, I find both official and unofficial specifications, some examples to work with, and then I build it from scratch. It's a fairly solid approach to problem solving, if just a tad overkill.

Compressing content with PHP ob_start() vs Apache Deflate/Gzip?

Most sites want to compress their content to save on bandwidth. However, When it comes to apache servers running PHP there are two ways to do it - with PHP or with apache. So which one is faster or easier on your server?
For example, in PHP I run the following function at the start of my pages to enable it:
/**
* Gzip compress page output
* Original function came from wordpress.org
*/
function gzip_compression() {
//If no encoding was given - then it must not be able to accept gzip pages
if( empty($_SERVER['HTTP_ACCEPT_ENCODING']) ) { return false; }
//If zlib is not ALREADY compressing the page - and ob_gzhandler is set
if (( ini_get('zlib.output_compression') == 'On'
OR ini_get('zlib.output_compression_level') > 0 )
OR ini_get('output_handler') == 'ob_gzhandler' ) {
return false;
}
//Else if zlib is loaded start the compression.
if ( extension_loaded( 'zlib' ) AND (strpos($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip') !== FALSE) ) {
ob_start('ob_gzhandler');
}
}
The other option is to use Apache deflate or gzip (both which are very close). To enable them you can add something like this to your .htaccess file.
AddOutputFilterByType DEFLATE text/html text/plain text/xml application/x-httpd-php
Since PHP is a scripting language (which must be loaded by PHP) I would assume that the apache method would be 1) more stable and 2) faster. But assumptions don't have much use in the real world.
After all, you would assume that with the huge financial backing windows has... uh, we won't go there.

We're running... a lot of webservers, handling 60M/uniques/day. Normally this isn't worth mentioning but your question seems based on experience.
We run with doing it in apache. What comes out the other end is the same (or near enough so as to not to matter) regardless of the method you choose.
We choose apache for few reasons:
Zero maintenance, we just turned it on. No one needs to maintain some case structure
Performance, in our tests servers where Apache did the work faired marginally better.
Apache will apply the output filter to everything, as opposed to just PHP. On some occasions there are other types of content being served on the same server, we'd like to compress our .css and .js
One word of warning, some browsers or other applications purposefully mangle the client headers indicating that compression is supported. Some do this to ease their job in terms of client side security (think applications like norton internet security and such). You can either ignore this, or try to add in extra cases to re-write requests to look normal (the browsers do support it, the application or proxy just futzed it to make its own life easier).
Alternatively, if you're using the flush() command to send output to the browser earlier, and you're applying compression you may need to pad the end of your string with whitespace to convince the server to send data early.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.