PHP Imagick reinterpretation of PNG IDAT chunks

PHP Imagick reinterpretation of PNG IDAT chunks - php

I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.

IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...

I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.

Related

PHP free file content analyzer library / function

I was wondering if you know about any good and accurate PHP library or file I can include in my script in order to analyse the content of X file and then check if it is an especific type like .doc, .docx .jpg, etc.
I know PHP offers a big number of libraries that we could use to check them, but they're not that accurate at all, some just checks the file extension or the file header (they don't even know if the file is broken or not)
What I request is for something very accurate, simple and faster (probably I'm requesting too much) but any link or suggestion will be accepted and appreciated, Thank you!

As far as I know, no such library exists; it also wouldn't make sense to have one.
let's say I have jpeg image I would like to analyse, the headers probably would be okay but the image itself is broken, and when I want to convert them or cut them for thumbnails (with the GD library which is the one I use) the functions (mostly imagecreatefromjpeg) will throw me errors, and in order to create a good thumbnail I need a valid image.
The best place to catch a malformed JPG file with malformed headers is when GD errors out while trying to process it. Just deal with that in a transparent and useful way (= let the user know that something went wrong). Why add extra code that would essentially have to do the same thing?
By handling the error when it occurs, you can also catch issues that a simple analysis of the file wouldn't reveal anyway - for example, GD can't deal with CMYK JPGs. Still, CMYK JPGs are perfectly valid files. Another example is files that are too big to be processed on your server.
Of course, you can do header or size checks beforehand on every uploaded file. But a separate check that goes as deeply as you want it doesn't make sense.
Apart I would like to have it to prevent virus or code injection..
This isn't a realistic goal. What if the library you open the file with to check it is vulnerable to the injection?
Also, injections like this are very rare; library vulnerabilities tend to be widely publicized, and patches quickly provided. Just keep your machine up to date.
If you really need enterprise-grade virus protection, get a server-side virus detection product.

What i did for this was to open the file, read it, and search for the file headers. most of them are available in their wikipedia format definition.
%PDF for pdf, first 4 chars.
%PNG for png, first 4 chars.
Havent seen yet a library to do that.

Best way to save PNG images for fast loading

I have some PHP code that processes a number of .PNG images, combining them pixel-by-pixel (so lots of ìmagecolorat calls). Some of these images can change, but a few are precalculated and rarely change.
The precalculated images are generated by GD and output in PHP using imagepng.
As they are read far more often than they are written, I'd like to optimize them for reading speed.
But what type of quality settings for imagepng is best to optimize reading performance in imagecreatefrompng?
Higher compression and filters create smaller files, but perhaps a bigger file with no compression or filters is faster to read?
Perhaps it's better to skip PNG files altogether and use raw, uncompressed binary files or something that can be read into a PHP array?

If you process almost the same files over and over again, you might want to stop messing with imagepng itself and move the logic one level higher. For example you may cache the finished images or function calls ( see for example Caching function results in PHP ).

using getimagesize() to determine whether a file is an image

Would calling getimagesize() on a file and checking if the returned value differs from false suffice to determine whether or not a file is an image?
Are there any other possibilities to determine if a file is an image in php, solutions that are more foolproof than simply checking the extension.

getimagesize() is a pretty reliable indication that the file is an image, yes.
It will determine if the image appears to have a valid header.
It (usually) won't determine if there is any corruption in the actual image data, which may show up as a messed up image or an error part way through loading the image.
You may also keep in mind that it is possible for a file to be a valid image file but also to conceal other data - either within metadata, image data, or after the end of the image data. So while getimagesize() may tell you you have a valid image, it doesn't necessarily mean the file isn't also valid as another type. Since JAR and ZIP files read from the end of the file, it's possible for a file to be both a valid image and a valid JAR/ZIP file, and JAR files are executable in a browser - the basis of the GIFAR exploit.

It would suffice to find out whether it's one of the supported file formats, yes. It actually parses the header bytes of the file, and is therefore very reliable.
It's the best method to use that is built into PHP.
Advanced tools like ImageMagick's identify command do essentially the same - consider them only if you need to support many more file formats than those supported by getimagesize() (their list is here, in the IMAGETYPE_* constants).

How to check image integrity?

I am building a web crawler, and one of its functions is to download images.
The problem is that sometimes, for some reason, there are images that are downloaded with errors in them, eg: Half of the image is plain gray or white, like it stopped downloading at some point, and then filled the void with gray. The image types are still considered valid, because I can get them with getimagesize, and also open and view them. But they are not like the originals.
Any ideas?

Compare response header Content-Length with actual number of bytes you received. There could be other reasons but I can't tell anything without seeing your code where you download that image.

I think this is a transmission interruption.
I see many cases: either your connection has been reset, in this case testing the socket signal should enable you to diagnose the problem and re initiate the download.
Or there is an undetected error during the transmission (but normally TCP/IP should deal with this) and/or you don't write all the downloaded correctly (you think you read all the data on socket, but read provides a smallest value and you don't check the returned value to check it's the intended size) and then your image is not complete.
Usually half grey images (especially JPEG) are sign of a file that is not complete (headers are ok, so you don't have problem with you getimagesize) but the JPEG does not end with a 0xFF 0xD9. So check you read all the data by comparing with the size you have to read.
Eventually you can write image format dependent function to check integrity of file for example by checking the flags within the JPEG. But it could be resource consuming.

Just do an imagecreatefromstring() and checks if returns not a resource

Get size of uncompressed data in zlib?

I'm creating something that includes a file upload service of sorts, and I need to store data compressed with zlib's compress() function. I send it across the internet already compressed, but I need to know the uncompressed file size on the remote server. Is there any way I can figure out this information without uncompress()ing the data on the server first, just for efficiency? That's how I'm doing it now, but if there's a shortcut I'd love to take it.
By the way, why is it called uncompress? That sounds pretty terrible to me, I always thought it would be decompress...

I doubt it. I don't believe this is something the underlying zlib libraries provide from memory (although it's been a good 7 or 8 years since I used it, the up-to-date docs don't seem to indicate this feature has been added).
One possibility would be to transfer another file which contained the uncompressed size (e.g., transfer both file.zip and file.zip.size) but that seems fraught with danger, especially if you get the size wrong.
Another alternative is, if the server uncompressing is time-expensive but doesn't have to be done immediately, to do it in a lower-priority background task (like with nice under Linux). But again, there may be drawbacks if the size checker starts running behind (too many uploads coming in).
And I tend to think of decompression in terms of "explosive decompression", not a good term to use :-)

If you're uploading using the raw 'compress' format, then you won't have information on the size of the data that's being uploaded. Pax is correct in this regard.
You can store it as a 4 byte header at the start of the compression buffer - assuming that the file size doesn't exceed 4GB.
some C code as an example:
uint8_t *compressBuffer = calloc(bufsize + sizeof (uLongf), 0);
uLongf compressedSize = bufsize;
*((uLongf *)compressBuffer) = filesize;
compress(compressBuffer + sizeof (uLongf), &compressedSize, sourceBuffer, bufsize);
Then you send the complete compressBuffer of the size compressedSize + sizeof (uLongf). When you receive it on the server side you can use the following code to get the data back:
// data is in compressBuffer, assume you already know compressed size.
uLongf originalSize = *((uLongf *)compressBuffer);
uint8_t *realCompressBuffer = compressBuffer + sizeof (uLongf);
If you don't trust the client to send the correct size then you will need to perform some sort of uncompressed data check on the server size. The suggestion of using uncompress to /dev/null is a reasonable one.
If you're uploading a .zip file, it contains a directory which tells you the size of the file when it's uncompressed. This information is built into the file format, again, though this is subject to malicious clients.

The zlib format doesn't have a field for the original input size, so I doubt you will be able to do that without simulating a decompression of the data. The gzip format has a "input size" (ISIZE) field, that you could use, but maybe you want to avoid changing the compression format or having the clients sending the file size.
But even if you use a different format, if you don't trust the clients you would still need to run a more expensive check to make sure the uncompressed data is the size the client says it is. In this case, what you can do is to make the uncompress-to-/dev/null process less expensive, making sure zlib doesn't write the output data anywhere, as you just want to know the uncompressed size.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.