Get size of uncompressed data in zlib?

Get size of uncompressed data in zlib? - php

I'm creating something that includes a file upload service of sorts, and I need to store data compressed with zlib's compress() function. I send it across the internet already compressed, but I need to know the uncompressed file size on the remote server. Is there any way I can figure out this information without uncompress()ing the data on the server first, just for efficiency? That's how I'm doing it now, but if there's a shortcut I'd love to take it.
By the way, why is it called uncompress? That sounds pretty terrible to me, I always thought it would be decompress...

I doubt it. I don't believe this is something the underlying zlib libraries provide from memory (although it's been a good 7 or 8 years since I used it, the up-to-date docs don't seem to indicate this feature has been added).
One possibility would be to transfer another file which contained the uncompressed size (e.g., transfer both file.zip and file.zip.size) but that seems fraught with danger, especially if you get the size wrong.
Another alternative is, if the server uncompressing is time-expensive but doesn't have to be done immediately, to do it in a lower-priority background task (like with nice under Linux). But again, there may be drawbacks if the size checker starts running behind (too many uploads coming in).
And I tend to think of decompression in terms of "explosive decompression", not a good term to use :-)

If you're uploading using the raw 'compress' format, then you won't have information on the size of the data that's being uploaded. Pax is correct in this regard.
You can store it as a 4 byte header at the start of the compression buffer - assuming that the file size doesn't exceed 4GB.
some C code as an example:
uint8_t *compressBuffer = calloc(bufsize + sizeof (uLongf), 0);
uLongf compressedSize = bufsize;
*((uLongf *)compressBuffer) = filesize;
compress(compressBuffer + sizeof (uLongf), &compressedSize, sourceBuffer, bufsize);
Then you send the complete compressBuffer of the size compressedSize + sizeof (uLongf). When you receive it on the server side you can use the following code to get the data back:
// data is in compressBuffer, assume you already know compressed size.
uLongf originalSize = *((uLongf *)compressBuffer);
uint8_t *realCompressBuffer = compressBuffer + sizeof (uLongf);
If you don't trust the client to send the correct size then you will need to perform some sort of uncompressed data check on the server size. The suggestion of using uncompress to /dev/null is a reasonable one.
If you're uploading a .zip file, it contains a directory which tells you the size of the file when it's uncompressed. This information is built into the file format, again, though this is subject to malicious clients.

The zlib format doesn't have a field for the original input size, so I doubt you will be able to do that without simulating a decompression of the data. The gzip format has a "input size" (ISIZE) field, that you could use, but maybe you want to avoid changing the compression format or having the clients sending the file size.
But even if you use a different format, if you don't trust the clients you would still need to run a more expensive check to make sure the uncompressed data is the size the client says it is. In this case, what you can do is to make the uncompress-to-/dev/null process less expensive, making sure zlib doesn't write the output data anywhere, as you just want to know the uncompressed size.

Related

PHP Imagick reinterpretation of PNG IDAT chunks

I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.

IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...

I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.

File Upload on unstable internet connection solution

So I have a file transfer website, which develop using PHP and HTML5. This site supposed to handle huge file transfer (100MB - 4GB file size). The HTML5 mainly is to split the file to smaller size. It works good for stable connection.
However there are several occasion, where the file upload just get cut off due to internet connection suddenly drop off and on for a split second.
My main question would be, is there anyway in php or html5 / javascript.. to handle this type of problem. e.g: maybe if the connection drop off only for couple second, the script will still keep going and when the connection on again, it will continue the file upload ?
or if such thing is impossible, is there anyway to detect the internet got drop off, and show alert("internet got drop"); something like that?
Thank you

I have had a recent problem involving large binary signal files over FTP being corrupted by drop-outs. Subsequently, we investigated along with FTP solutions options on the use of web-based uploading similar to the sort of handling you seem to be looking for. The most robust solution remains the use of an rsync implementation and a continuous checksum.
That said, the option that seems to be missing directly from the HTML5 API is to be able to send a hash or checksum of the chunk in the chunk itself so that the verification that the data blob has been received intact is done transparently.
You should essentially be able this by doing some processing on the slice data and combining this with the blob before then sending.
while( start < SIZE ) {
var chunk = blob.slice(start, end);
//var chunkHash = chunk.ComputeHash(); // e.g. CRC32 or Adler
//chunk = chunkHash + chunk
uploadFile(chunk);
start = end;
end = start + BYTES_PER_CHUNK;
}
At the other end you pull the first n bytes and compute the hash on the rest then compare the first n bytes with your data.
Some articles for inspiration:
uploading a file in chunks using html5 is a good place I think to look at handling the file uploads
how to merge chunks of file (result of html5 chunking) into one file fast and efficient
Finally, probably worth investigating pre-existing libraries and doing some 'drop-out' testing to identify the best performing. http://www.plupload.com/ seems to have some following (not affiliated and untested for this purpose).

Recording Audio on website: Red5 stream or posting the audio data?

Let me first establish what I want to do:
My user is able to record voicenotes on my website, add tags to said notes for indexing as well as a title. When the note is saved I save the path of the note along with the other info in my DB.
Now, I have 2 choices to do the recording, both involve a .swf embedded in my site:
1) I could use Red5 server to stream the audio to my server and save the file and return the path to said file to my app to do the DB saving, seems rather complicated since I would have to convert the audio and move it to the appropriate folder that belongs to the user in a server side Red5 app, which I'm not very aware of how to build.
2) I could simply record the audio and grab its byte array, do a Base64 encoding on it and send it to PHP along with the rest of the data that is necessary (be it by a simple POST or an AJAX call), decode it on the server and make the file with the appropriate extension, audio conversion would also occur here using ffmpeg, this option seems simpler but I do not know how viable it is.
What option would you say is more viable and easier to develop? Thanks in advance

Depending on the planned duration of the recording, you may very well be able to use option number two. I recently used a similar approach successfully for a project, but recordings were only up to 30 seconds or so. Here's what I did differently from what you're suggesting though, and why I think it's better:
To capture the sound from the microphone and store it to a ByteArray, use the SAMPLE_DATA event which is dispatched whenever more sound data comes in from the microphone. There's an example in the documentation that should explain this well enough.
Because most users would be on normal home computers without any special recording equipment, it was safe to assume that the full fidelity of the recording is not necessary. I used just 2 bytes per sample, and only mono, instead of using the full 64 bit floats (AS3 Number) that you get from the microphone on the SAMPLE_DATA event. Simply read the Number and do myFloatSample * 0x7fff to convert to 16 bit signed integer.
Don't use the native 44.1kHz sampling rate if you're just recording speech or something else in that frequency range. You will likely get away just fine with 22.05kHz, which will cut the amount of data in half straight away. Just set the Microphone.rate property accordingly.
Don't use Base64 to encode your data. Send it as binary data, which will be significantly smaller. You can send it as raw POST data, or using something like AMF. Also, before you send it, use the native compress() or deflate() methods on the ByteArray to compress it. On the server, decompress using the ZLIB or raw DEFLATE (inflate) algorithms respectively, which PHP supports.
Once decompressed on the server, what you have is essentially what is called a raw 16-bit mono PCM stream. Incidentally, that should be one of the very input formats that ffmpeg (or lame) supports, so you should be able to encode it to mp3 without having to do any manual decoding first.
Obviously the Red5 solution will likely be better, because it's more tailored for the task. But if you don't have the resources to set up a Red5 server, or don't want to use Java, the above solution is proven to work well as long as you stay away from too long recordings.
To take a simple example, a 30 second recording at 22,050 samples per second, 2 bytes per sample will be ~1.3MB. Even once deflated, the transfer to the server will likely still be almost a megabyte for 30 seconds of audio. This may or may not be acceptable for your application.

How to check image integrity?

I am building a web crawler, and one of its functions is to download images.
The problem is that sometimes, for some reason, there are images that are downloaded with errors in them, eg: Half of the image is plain gray or white, like it stopped downloading at some point, and then filled the void with gray. The image types are still considered valid, because I can get them with getimagesize, and also open and view them. But they are not like the originals.
Any ideas?

Compare response header Content-Length with actual number of bytes you received. There could be other reasons but I can't tell anything without seeing your code where you download that image.

I think this is a transmission interruption.
I see many cases: either your connection has been reset, in this case testing the socket signal should enable you to diagnose the problem and re initiate the download.
Or there is an undetected error during the transmission (but normally TCP/IP should deal with this) and/or you don't write all the downloaded correctly (you think you read all the data on socket, but read provides a smallest value and you don't check the returned value to check it's the intended size) and then your image is not complete.
Usually half grey images (especially JPEG) are sign of a file that is not complete (headers are ok, so you don't have problem with you getimagesize) but the JPEG does not end with a 0xFF 0xD9. So check you read all the data by comparing with the size you have to read.
Eventually you can write image format dependent function to check integrity of file for example by checking the flags within the JPEG. But it could be resource consuming.

Just do an imagecreatefromstring() and checks if returns not a resource

Generating ZIP files with PHP + Apache on-the-fly in high speed?

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.

Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)

You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.