Unique image hash that does not change if EXIF info updated - php

I'm looking for a way to create a unique hash for images in python and php.
I thought about using md5 sums for the original file because they can be generated quickly, but when I update EXIF information (sometimes the timezone is off) it changes the sum and the hash changes.
Are there any other ways I can create a hash for these files that will not change when the EXIF info is updated? Efficiency is a concern, as I will be creating hashes for ~500k 30MB images.
Maybe there's a way to create an md5 hash of the image, excluding the EXIF part (I believe it's written at the beginning of the file?) Thanks in advance. Example code is appreciated.

Imagemagick already provides a method to get the image signature. According to the PHP documentation:
Generates an SHA-256 message digest for the image pixel stream.
So my understanding is that the signature isn't affected by changes in the exif information.
Also, I've checked that the PythonMagick.Image.signature method is available in the python bindings, so you should be able to use it in both languages.

In Python, you could use Image.tostring() to compute the md5 hash for the image data only, without the metadata.
import Image
import hashlib
img = Image.open(filename).convert('RGBA')
m=hashlib.md5()
m.update(img.tostring())
print(m.hexdigest())

Related

Load binary JPEG image data into PHP variable for MD5 hashing

I am writing a PHP script to verify that the JPEG data in two files are identical. The EXIF/IPTC (metadata) may change between the two files.
My general approach is to use an MD5 hash to compare the binary JPEG data of the two files to confirm it's unchanged.
However, no matter what I do using GD, I seem to be getting an MD5 hash of BOTH the metadata and JPEG data. Does anyone know the best method to extract just the image data from a JPEG file using PHP?
Thanks in advance...
#jarek.d above suggested using mogrify (part of imagemagick), so I am using exec to strip the metadata before comparing the two files. This works well.

Store image location in database or generate them on the fly based on id

I've been refactoring some code and throwing away some old spaghetti. I am now faced with the following issue:
I have tv episodes which have a screenshot source file and 4 thumbnails. The current code generates the paths during the creation of the thumbnails and also when they are loaded. So the actual path to the image is never stored anywhere. It is generated based on the database id of the episode (using md5 hashes).
This quickly became a mess. Now I decided I store the path to the src and all 4 sizes in a simple json array and plug it into the database.
The question is whether this has any significant downsides? The entire json string is always between 500 and 550 chars.
Or should I stick to the on the fly generation of the paths and figure out a more maintainable way of doing so?
I think either way is valid, but find easier to handle md5, as you dont have to handle json deserialization an variable extraction, simply create the hash and file path.
May be the issue has to be with processing of several md5 hashes, vs storing several json data.
Just choose the one you like more.

How would you implement private images?

I'm developping an App in Android which somehow has avatars like Whatsapp do. As you know, in WhatsApp you can create a group, and set a group picture for it.
I don't have any problems on taking the image, saving, etc. The problem I have is that I'm developing the webservice in Symfony2 (PHP) and I want to receive the image and save it somewhere on the server. However, obviously those images are NOT public and should be only viewed for users with permissions. I've thought about traditional method: saving the image on a folder and giving the link or not, but this is totally easy to hack.
So guys, how would you do this? Maybe saving the binary data into MySql directly? Is there any clean way to achieve this?
Any tips are appreciated.
Thanks.
Another answer is to set the mime type of the PHP call to be an image. A call to a URL like http://xxx/images.php?id=8989031289130 would then return an image instead of an HTML file.
You then have access to the PHP security context and can validate whether the user actually has permissions to view this file.
There are some more details at:
Setting Mime type in PHP
The typical answer here is to use a file naming scheme that precludes guessing. For example, you could take the filename plus a secret salt, hash them together, and append the hash to the filename (before the extension). Thus, what would be /foo/bar/baz.jpg would become /foo/bar/baz_8843d7f92416211de9ebb963ff4ce28125932878.jpg.
So long as your hash salt remains secret, filenames are more or less mathematically protected from random or brute-force discovery. This is, for example, the core of how Facebook protects its' users pictures without having to actually require authentication for each image request (which doesn't scale well at all).

Library to handle QRCode generation serverside and store within database?

What I want to accomplish:
When a user hits "Generate QRCode" javascript will take the local machine's datetime and create an md5 hash based on the MMDDYYHHMMSS format. I want to take that hash and have the server generate a QRCode based on that hash and store it within the server's media folder. However all the libraries for QRCode take input and generate the QRCode clientside with no image resources, so I have no way to store it.
Does anyone have any answers as to how I should approach such an implementation?
I created a server side PHP library which will generate a JPG / PNG / GIF or your QR code.
https://github.com/edent/QR-Generator-PHP/
So, take the hash, pass it to your webserver. Generate the QR code, and then save it.

How does comparing images through md5 work?

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.
An MD5 hash is of the actual binary data, so different formats will have completely different binary data.
so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)
This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)
It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.
Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.
md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.
So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.
Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.
Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.
A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".
To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.
If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.
md5 is a hash. It is a code that is calculated from a bunch of data - any data really.
md5 is certainly not unique, but the chance that two different images have the exact same code is quite small. Therefor you could compare images by calculating an md5 code from each of them and compare the codes.
You cannot compare using the MD5 sum, as all the other posters have noted. However, you can compare the images in a different way, and it will tell you their similarity regardless of image type, or even size. You can use libPuzzle
http://libpuzzle.pureftpd.org/project/libpuzzle
This is a great library for image comparison and works very well.
It will still not work. Any image contains the header portion and the binary image buffer. In the said scenario
1. The the headers will be different between .jpg & .gif resulting in a different md5 sum
2. The image buffer itself may be different due to image compression as used by say the .jpg format.
md5sum is a tool used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum program is included in most Unix-like operating systems or compatibility layers such as Cygwin.
Hence it cannot be used to compare images.
Running md5sum on images will generate md5 hash based on images raw data. The output of hash strings for these images will not be the same since image format are not the same i.e. GIF and JPEG.
In addition, if you compare the sizes of these images will not be the same either. Usually GIF images can be bigger than JPEG files, which means MD5 hash strings will not tally at all.

Categories