Load binary JPEG image data into PHP variable for MD5 hashing - php

I am writing a PHP script to verify that the JPEG data in two files are identical. The EXIF/IPTC (metadata) may change between the two files.
My general approach is to use an MD5 hash to compare the binary JPEG data of the two files to confirm it's unchanged.
However, no matter what I do using GD, I seem to be getting an MD5 hash of BOTH the metadata and JPEG data. Does anyone know the best method to extract just the image data from a JPEG file using PHP?
Thanks in advance...

#jarek.d above suggested using mogrify (part of imagemagick), so I am using exec to strip the metadata before comparing the two files. This works well.

Related

PHP encrypt / decrypt huge string text for MySQL DB

I have multiple text files that are very large, and adding them on MySQL is 100 text = is over 1MB (this is just an example) and I was thinking if is possible to encrypt them so I can make the text shorter so will use less MySQL DB space? and when I'm getting them back from MySQL to be able to decrypt so I can see the real text?
I try to use base_64 and other gzip compress, but all of them is making the size much bigger than original.
How can I compress the text files (encrypt / decrypt)?
you can use InnoDB (engine) compression. As you've asked, it is the same as ZIP compression
Answer is no :) You can't reduce text files using encryption but you can compress text data in database. InnoDB compression example in MySQL
PHP has the ability to manipulate .zip files.
You could save your text into a .zip file, and simply store the filename in the database. This would save a lot of MySQL database space, but you will need some way to generate unique filenames, and somewhere to store those files.
At least they would be zipped, to save as much disk space as possible...
If you want to make DB shorter, you can save large texts as files (on local server, CDN, or remote servers). Keep only filenames in DB and additional information about texts.
In result, you will be able to use the database in your application and read files from hard disks.

PHP Imagick reinterpretation of PNG IDAT chunks

I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.
IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.

Unique image hash that does not change if EXIF info updated

I'm looking for a way to create a unique hash for images in python and php.
I thought about using md5 sums for the original file because they can be generated quickly, but when I update EXIF information (sometimes the timezone is off) it changes the sum and the hash changes.
Are there any other ways I can create a hash for these files that will not change when the EXIF info is updated? Efficiency is a concern, as I will be creating hashes for ~500k 30MB images.
Maybe there's a way to create an md5 hash of the image, excluding the EXIF part (I believe it's written at the beginning of the file?) Thanks in advance. Example code is appreciated.
Imagemagick already provides a method to get the image signature. According to the PHP documentation:
Generates an SHA-256 message digest for the image pixel stream.
So my understanding is that the signature isn't affected by changes in the exif information.
Also, I've checked that the PythonMagick.Image.signature method is available in the python bindings, so you should be able to use it in both languages.
In Python, you could use Image.tostring() to compute the md5 hash for the image data only, without the metadata.
import Image
import hashlib
img = Image.open(filename).convert('RGBA')
m=hashlib.md5()
m.update(img.tostring())
print(m.hexdigest())

get size of file in php

I wanna save mail attachments with size in database.
So I open mail in text mode by php,for attachments I can see something like example:
Content-Type: image/jpeg; name="donoghte D2.jpg"
Content-Disposition: attachment; filename="donoghte D2.jpg"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gvn2345e0
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcG
BwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwM ...
I will show it by this code
<?php
header('Content-Type: image/jpeg');
echo (base64_decode($text));
?>
If I wanna to calculate the size of this file,and store it and its size in database,what is the best way?
Should I save encode64 of this(like what sent in mail) in a database?
If so, what should the datatype of that field be?
To calculate size of it, should I decode it, then get strlen of it? or is there any faster way?
with special thanks for your attention
You're dealing with a binary object there, so it's probably best to store it as the same. I'm not sure what database you're using, but MySQL has the BLOB (Binary Large OBject) for this exact purpose.
You could also write it to the file system. There's dozens of good discussions about the merits of both techniques on Stack Overflow, so I won't go into it here (eg: Storing images in DB? Yea or Nay?)
I believe that if you have the decoded data in a string, then strlen would give you the file size of it. You could also query the database or filesystem after storage to get it.
to get a string length in PHP you can use strlen()
format I would use blob which enables you to store binary data in base64_decoded form. But if you don't care about storage capacity and want to resend the attachment, you may store the base64_encoded data (to save processing time) in any text format the DB supports. If you care about DB storage capacity, I would save the file separatedly and store only file name and path into DB.
get file size To get the image length, use strlen on decoded data. It would be better to use also Content-length header.
According to http://us3.php.net/manual/en/function.mb-strlen.php#47309, the following code should give you the string length with multibyte characters counted as 2 characters.
mb_strlen($utf8_string, 'latin1');
However, I would suggest saving the file on disk, as this is usally a lot better performance wise, pro's and con's listed in nickf's post: https://stackoverflow.com/a/8339065/863577

How does comparing images through md5 work?

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.
An MD5 hash is of the actual binary data, so different formats will have completely different binary data.
so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)
This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)
It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.
Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.
md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.
So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.
Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.
Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.
A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".
To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.
If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.
md5 is a hash. It is a code that is calculated from a bunch of data - any data really.
md5 is certainly not unique, but the chance that two different images have the exact same code is quite small. Therefor you could compare images by calculating an md5 code from each of them and compare the codes.
You cannot compare using the MD5 sum, as all the other posters have noted. However, you can compare the images in a different way, and it will tell you their similarity regardless of image type, or even size. You can use libPuzzle
http://libpuzzle.pureftpd.org/project/libpuzzle
This is a great library for image comparison and works very well.
It will still not work. Any image contains the header portion and the binary image buffer. In the said scenario
1. The the headers will be different between .jpg & .gif resulting in a different md5 sum
2. The image buffer itself may be different due to image compression as used by say the .jpg format.
md5sum is a tool used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum program is included in most Unix-like operating systems or compatibility layers such as Cygwin.
Hence it cannot be used to compare images.
Running md5sum on images will generate md5 hash based on images raw data. The output of hash strings for these images will not be the same since image format are not the same i.e. GIF and JPEG.
In addition, if you compare the sizes of these images will not be the same either. Usually GIF images can be bigger than JPEG files, which means MD5 hash strings will not tally at all.

Categories