I am building a web crawler, and one of its functions is to download images.
The problem is that sometimes, for some reason, there are images that are downloaded with errors in them, eg: Half of the image is plain gray or white, like it stopped downloading at some point, and then filled the void with gray. The image types are still considered valid, because I can get them with getimagesize, and also open and view them. But they are not like the originals.
Any ideas?
Compare response header Content-Length with actual number of bytes you received. There could be other reasons but I can't tell anything without seeing your code where you download that image.
I think this is a transmission interruption.
I see many cases: either your connection has been reset, in this case testing the socket signal should enable you to diagnose the problem and re initiate the download.
Or there is an undetected error during the transmission (but normally TCP/IP should deal with this) and/or you don't write all the downloaded correctly (you think you read all the data on socket, but read provides a smallest value and you don't check the returned value to check it's the intended size) and then your image is not complete.
Usually half grey images (especially JPEG) are sign of a file that is not complete (headers are ok, so you don't have problem with you getimagesize) but the JPEG does not end with a 0xFF 0xD9. So check you read all the data by comparing with the size you have to read.
Eventually you can write image format dependent function to check integrity of file for example by checking the flags within the JPEG. But it could be resource consuming.
Just do an imagecreatefromstring() and checks if returns not a resource
Related
I have a picture. For whatever reason, I need that picture to be sent to an environment that can only receive text and not images. Images and other files must be sent through their filter and I want to get around this. I calculated that there would be 480,000 independent hex values being manipulated but this is really the only option I have. Also, is it possible to compress and uncompress it for less pixels being sent? I will need to send the picture from a PHP web server [lets say, mysite.com/image.php] and receive it in Lua, and my only connection to the server is over a web request. No ftp, no even loading image files. Just setting 480,000 variables to the different id's
Oh, one more thing: it needs to not crash my server when I run it. ;)
Convert your image to base64 (Eg: Can pass to the variable).
Eg: I converted PNG image
Base 64 image will look like this.
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAcAAAAHCAYAAADEUlfTAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAE9JREFUeNpiYMADGLEJKssrCACp+Uw4JPYD8QdGHBIP7j58EMgCFDAAcvqBOBGI64FYAMpmYIFqAilYD6Udgbo+IBvXAMT/gXg9sjUAAQYAG6IS47QjgzEAAAAASUVORK5CYII="
You can use it in image source to display.
Hope this helps!
I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.
IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.
Would calling getimagesize() on a file and checking if the returned value differs from false suffice to determine whether or not a file is an image?
Are there any other possibilities to determine if a file is an image in php, solutions that are more foolproof than simply checking the extension.
getimagesize() is a pretty reliable indication that the file is an image, yes.
It will determine if the image appears to have a valid header.
It (usually) won't determine if there is any corruption in the actual image data, which may show up as a messed up image or an error part way through loading the image.
You may also keep in mind that it is possible for a file to be a valid image file but also to conceal other data - either within metadata, image data, or after the end of the image data. So while getimagesize() may tell you you have a valid image, it doesn't necessarily mean the file isn't also valid as another type. Since JAR and ZIP files read from the end of the file, it's possible for a file to be both a valid image and a valid JAR/ZIP file, and JAR files are executable in a browser - the basis of the GIFAR exploit.
It would suffice to find out whether it's one of the supported file formats, yes. It actually parses the header bytes of the file, and is therefore very reliable.
It's the best method to use that is built into PHP.
Advanced tools like ImageMagick's identify command do essentially the same - consider them only if you need to support many more file formats than those supported by getimagesize() (their list is here, in the IMAGETYPE_* constants).
Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.
Corrupted, to me, is the image is loaded into Firefox and it says
The image “such and such image” cannot be displayed, because it contains errors.
Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.
Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?
I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).
from PIL import Image
v_image = Image.open(file)
v_image.verify()
Catch the exceptions...
From the documentation:
im.verify()
Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.
i suggest you check out imagemagick for this: http://www.imagemagick.org/
there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided
In PHP, with exif_imagetype():
if (exif_imagetype($filename) === false)
{
unlink($filename); // image is corrupted
}
EDIT: Or you can try to fully load the image with ImageCreateFromString():
if (ImageCreateFromString(file_get_contents($filename)) === false)
{
unlink($filename); // image is corrupted
}
An image resource will be returned on
success. FALSE is returned if the
image type is unsupported, the data is
not in a recognized format, or the
image is corrupt and cannot be loaded.
If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.
Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.
However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)
You could use imagemagick if it is available:
if you want to do a whole folder
identify "./myfolder/*" >log.txt 2>&1
if you want to just check a file:
identify myfile.jpg
I'm creating something that includes a file upload service of sorts, and I need to store data compressed with zlib's compress() function. I send it across the internet already compressed, but I need to know the uncompressed file size on the remote server. Is there any way I can figure out this information without uncompress()ing the data on the server first, just for efficiency? That's how I'm doing it now, but if there's a shortcut I'd love to take it.
By the way, why is it called uncompress? That sounds pretty terrible to me, I always thought it would be decompress...
I doubt it. I don't believe this is something the underlying zlib libraries provide from memory (although it's been a good 7 or 8 years since I used it, the up-to-date docs don't seem to indicate this feature has been added).
One possibility would be to transfer another file which contained the uncompressed size (e.g., transfer both file.zip and file.zip.size) but that seems fraught with danger, especially if you get the size wrong.
Another alternative is, if the server uncompressing is time-expensive but doesn't have to be done immediately, to do it in a lower-priority background task (like with nice under Linux). But again, there may be drawbacks if the size checker starts running behind (too many uploads coming in).
And I tend to think of decompression in terms of "explosive decompression", not a good term to use :-)
If you're uploading using the raw 'compress' format, then you won't have information on the size of the data that's being uploaded. Pax is correct in this regard.
You can store it as a 4 byte header at the start of the compression buffer - assuming that the file size doesn't exceed 4GB.
some C code as an example:
uint8_t *compressBuffer = calloc(bufsize + sizeof (uLongf), 0);
uLongf compressedSize = bufsize;
*((uLongf *)compressBuffer) = filesize;
compress(compressBuffer + sizeof (uLongf), &compressedSize, sourceBuffer, bufsize);
Then you send the complete compressBuffer of the size compressedSize + sizeof (uLongf). When you receive it on the server side you can use the following code to get the data back:
// data is in compressBuffer, assume you already know compressed size.
uLongf originalSize = *((uLongf *)compressBuffer);
uint8_t *realCompressBuffer = compressBuffer + sizeof (uLongf);
If you don't trust the client to send the correct size then you will need to perform some sort of uncompressed data check on the server size. The suggestion of using uncompress to /dev/null is a reasonable one.
If you're uploading a .zip file, it contains a directory which tells you the size of the file when it's uncompressed. This information is built into the file format, again, though this is subject to malicious clients.
The zlib format doesn't have a field for the original input size, so I doubt you will be able to do that without simulating a decompression of the data. The gzip format has a "input size" (ISIZE) field, that you could use, but maybe you want to avoid changing the compression format or having the clients sending the file size.
But even if you use a different format, if you don't trust the clients you would still need to run a more expensive check to make sure the uncompressed data is the size the client says it is. In this case, what you can do is to make the uncompress-to-/dev/null process less expensive, making sure zlib doesn't write the output data anywhere, as you just want to know the uncompressed size.