How does comparing images through md5 work?

How does comparing images through md5 work? - php

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.

An MD5 hash is of the actual binary data, so different formats will have completely different binary data.
so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)
This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)

It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.
Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.

md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.
So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.
Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.
Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.

A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".
To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.

If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.

md5 is a hash. It is a code that is calculated from a bunch of data - any data really.
md5 is certainly not unique, but the chance that two different images have the exact same code is quite small. Therefor you could compare images by calculating an md5 code from each of them and compare the codes.

You cannot compare using the MD5 sum, as all the other posters have noted. However, you can compare the images in a different way, and it will tell you their similarity regardless of image type, or even size. You can use libPuzzle
http://libpuzzle.pureftpd.org/project/libpuzzle
This is a great library for image comparison and works very well.

It will still not work. Any image contains the header portion and the binary image buffer. In the said scenario
1. The the headers will be different between .jpg & .gif resulting in a different md5 sum
2. The image buffer itself may be different due to image compression as used by say the .jpg format.

md5sum is a tool used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum program is included in most Unix-like operating systems or compatibility layers such as Cygwin.
Hence it cannot be used to compare images.
Running md5sum on images will generate md5 hash based on images raw data. The output of hash strings for these images will not be the same since image format are not the same i.e. GIF and JPEG.
In addition, if you compare the sizes of these images will not be the same either. Usually GIF images can be bigger than JPEG files, which means MD5 hash strings will not tally at all.

Related

PHP: Is the CRC from ZipArchive::statIndex() unique enough to be used to detect duplicate files in many zips?

If I have multiple zip files and I loop through the contents of each to find unique files, will the CRC value be the same for the same file in different zips?
The statindex method on ZipArchive returns an array like this:
Array
(
[name] => foobar/baz
[index] => 3
[crc] => 499465816
[size] => 27
[mtime] => 1123164748
[comp_size] => 24
[comp_method] => 8
)
To be honest, the filesize will probably be unique enough for my needs, but to be safe I was looking for another way to detect uniqueness.
From what I can tell the only alternative would be to extract and then use the file-hash method, but this would be alot slower than just being able to use something that's made available from the ZipArchive class.
In my case I have a directory of about 230,000 images built from 30,000 zips with around 30 images in each zip and I want to create a database of which images came from which zip, and I know that there will be lots of duplicates.

No, a 32bit CRC collides too easily. Consider comparing the CRC and the size (and preferably also the compressed size and the compression method) - if all 4 are the same it's safe enough to assume identical files.
However, what's your definition of "duplicate"?
Two picture files can have the identical payload (the actual photo) but different metadata (caption, comment...) - in that case you'd hash portions of the files yourself, so metadata is ignored.
Two picture files can portrait the same scene, but have different dimensions (i.e. 800x600 versus 1600x1200) or a different compression (lossy, lossless, interlaced...) - in that case you have to visually interpret them.
Two picture files can result in rendering the same display, but have different formats (i.e. PNG, TIFF, JPEG, WEBP...) - in that case you want to compare the rendered bitmaps of them.
As you see: extracting/uncompressing the files would make you able to operate more precisely, first of all using your favorite software for detecting duplicates.

A two stages approach
The CRC is, as far as I can tell, a 32 bit unsigned integer (4,294,967,295 values). For bigger files, like images, we can assume it has a flat random distribution. I would combine this with the size to get a hopefully unique string:
$stat = $zip->statIndex($index);
$str = $stat["crc"] . $stat["size"];
If the compression method is the same in all ZIP files you could add the compressed size:
$stat = $zip->statIndex($index);
$str = $stat["crc"] . $stat["size"] . $stat["comp_size"];
That would make it highly unlikely that two different images result in the same string, but just like with real hashes there is still a very small change that it will return the same string for two different images.
I don't think that is acceptable.
However, if two images return the same string you can still inspect them more closely to check whether they are indeed the same. You could start with one of the better hashes, but why not simply do a byte-by-byte comparison? This way you can actually be absolutely sure about the uniqueness of your images.
Sure, this will be slower than just relying on the stats, but I think you have to agree that this is better than having a very small change of misidentifying images.
So my approach here would be to do a rough check with the crc and size first. If these are the same then I would actually compare the files to make sure they are really the same. This way I never run the change of assuming two images are the same, because their crs/size's are the same, when they are not.

Load binary JPEG image data into PHP variable for MD5 hashing

I am writing a PHP script to verify that the JPEG data in two files are identical. The EXIF/IPTC (metadata) may change between the two files.
My general approach is to use an MD5 hash to compare the binary JPEG data of the two files to confirm it's unchanged.
However, no matter what I do using GD, I seem to be getting an MD5 hash of BOTH the metadata and JPEG data. Does anyone know the best method to extract just the image data from a JPEG file using PHP?
Thanks in advance...

#jarek.d above suggested using mogrify (part of imagemagick), so I am using exec to strip the metadata before comparing the two files. This works well.

PHP Imagick reinterpretation of PNG IDAT chunks

I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.

IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...

I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.

Similar images - how to compare them

I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.
My company take an image and create a version that can be utilized by our vendors.
The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.
What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.
I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.
I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)
Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.
If I didn't mention it, I will be happy with something that just suggest possible "similar" images.
EDIT
Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.
I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here

Try using file_get_contents and:
http://www.php.net/manual/en/function.hash-file.php
If the hashes match, then you know they are the exact same.
EDIT:
If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.
The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.
EDIT #2:
Please view the solution here: Image comparison - fast algorithm

To speedup the process, sort all the files with size and compare internals only if two sizes are equal. To compare internal data, using hash comparison is also fastest way. Hope this helps.

Image file cheksum as a unique content compare optimalisation

Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.
Possible solutions:
1/ Store forbidden files and compare whole content - works well but is slow.
2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.
3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.
What is the best solution?

Don't calculate checksums, calculate hashes!
I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image.
As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates.
However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.
Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system.
This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.
My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)

Image similarity comparison isn't exactly a trivial problem, so unless you really want to devote a lot of effort to image comparison algorithms, your idea of creating some sort of hash of the image data and comparing that will at least allow you to quickly detect exact duplicates. I'd go with your current plan, but make sure it's a decent (but fast) hash so that the likelihood of collisions is low.

The problem with hashes, as suggested, is that if someone changes 1 pixel the hash turns out completely different.
There are excellent frameworks out there that are able to compare the contents of a file, and return (in a percentage) how much they look alike. There is one in specific, a command line app, I once came across which was build within a scientific environment and it was open source but I can't remember its name.
This kind of framework could definitely help you out, since they can be extremely fast, even with a large number of files.

Upload image to ipfs and store cid. every cid is unique to the file. store thumbnails locally

To give a more relevant answer than my first, I suggest Google Vision API for image recognition (google it haha) or write a simple script to see what goigle lens says about an item.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.