more reliable way to check for duplicate files with PHP

more reliable way to check for duplicate files with PHP - php

I have seen the questions and answers about using md5 and sha1 hashes as well as the file sizes to compare the files, I setup the system and while this works most of the time there is one specific case where it matches false positives with identical md5, sha1 hashes and file size.
Specifically my case it happens when users upload pictures directly from ipad and iphones even though the images are actually totally different.
I am wondering if anyone knows of a different method to add to the check that is more reliable and unique.
Thank you.
edit: also using the file size in bytes

Use a combination of MD5 and File size, this should be very accurate.
Are you sure that there is no error when creating the hash? It is very unlikely to get a lot of false positives.

Related

Comparing file checksums in PHP

I'm writing a file upload site, and am interested in saving space. If a user uploads a file, I want to ensure this file has not already been uploaded before (if it has been, I will just point to the existing file in the database).
I was considering using sha1_file() on the file, checking the database to see if the digest exists in a database of digests. Then I remembered the pigeonhole principle, and decided to check the undigested files against each other if there is a sha1 digest match.
This seems inefficient to me. I figure I could just check the first kilobyte of each file against each other in the event of a check sum match.
I haven't thought too much about the value of RAM versus ROM, and it might be possible that the processing power required to check the files costs more than the storage space I would save.
Are there any shortcomings to this method? Am I wasting my time in even bothering with this?

you could use md5( file_data ) to generate the names of the files and it will never be possible to upload the same file with a different name. only problem with this is that it could be technically possible that two different files generate the same md5, but its unlikely, especially if the two files have the same extension, so you could consider this a non problem. under this schematic, there is no reason to even check. if two hashes are the same, it simply overwrites the stored file. this is how most file storage engines work internally, such as zimg. if you are paranoid about collisions, you could see first if the file exists with the computed hash and extension, and if it does you could compare the data of that stored file vs the data of the file that you are attempting to store. if the data is inequal, you could have it email you an alert.
$data = file_get_contents('flowers.jpg');
$name = md5($data).'.jpg';
$fh = fopen($name,'w+');
fwrite($fh,$data);
fclose($fh);

Similar images - how to compare them

I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.
My company take an image and create a version that can be utilized by our vendors.
The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.
What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.
I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.
I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)
Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.
If I didn't mention it, I will be happy with something that just suggest possible "similar" images.
EDIT
Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.
I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here

Try using file_get_contents and:
http://www.php.net/manual/en/function.hash-file.php
If the hashes match, then you know they are the exact same.
EDIT:
If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.
The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.
EDIT #2:
Please view the solution here: Image comparison - fast algorithm

To speedup the process, sort all the files with size and compare internals only if two sizes are equal. To compare internal data, using hash comparison is also fastest way. Hope this helps.

Preventing duplicate enteries in database

I am using a form to upload files(images) to my server. How can I possibly prevent the same image from being uploaded twice?? I cannot possibly look wether the a image by same title exists as same images can have different titles and different images can have same title.
Any help is appreciated.

Create a hash like ZombieHunter suggested. Why? Because is easy and fast to search and check through a big table of hashes if the image already exists. Unfortunately all this hash metdods like md5 or md5_file work on existing files not on remote ones. So you will have to upload the file anyway. What you can do is then decide if you want to keep or not the file. If you are fetching the files from an online resource, maybe there are ways to detect from headers the file size and run a hash without downloading it, but this is a special case.
Also if you have other business logic attached to those images, with concepts like userHasImages or companyHasImages you can organize them in namespaces/folders/tags so you can speed the search even further.
In terms of database strictly speaking prevention of duplicate entries, use an unique index for the column that contains the hash.

How does comparing images through md5 work?

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.

An MD5 hash is of the actual binary data, so different formats will have completely different binary data.
so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)
This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)

It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.
Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.

md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.
So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.
Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.
Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.

A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".
To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.

If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.

md5 is a hash. It is a code that is calculated from a bunch of data - any data really.
md5 is certainly not unique, but the chance that two different images have the exact same code is quite small. Therefor you could compare images by calculating an md5 code from each of them and compare the codes.

You cannot compare using the MD5 sum, as all the other posters have noted. However, you can compare the images in a different way, and it will tell you their similarity regardless of image type, or even size. You can use libPuzzle
http://libpuzzle.pureftpd.org/project/libpuzzle
This is a great library for image comparison and works very well.

It will still not work. Any image contains the header portion and the binary image buffer. In the said scenario
1. The the headers will be different between .jpg & .gif resulting in a different md5 sum
2. The image buffer itself may be different due to image compression as used by say the .jpg format.

md5sum is a tool used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum program is included in most Unix-like operating systems or compatibility layers such as Cygwin.
Hence it cannot be used to compare images.
Running md5sum on images will generate md5 hash based on images raw data. The output of hash strings for these images will not be the same since image format are not the same i.e. GIF and JPEG.
In addition, if you compare the sizes of these images will not be the same either. Usually GIF images can be bigger than JPEG files, which means MD5 hash strings will not tally at all.

Image file cheksum as a unique content compare optimalisation

Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.
Possible solutions:
1/ Store forbidden files and compare whole content - works well but is slow.
2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.
3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.
What is the best solution?

Don't calculate checksums, calculate hashes!
I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image.
As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates.
However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.
Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system.
This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.
My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)

Image similarity comparison isn't exactly a trivial problem, so unless you really want to devote a lot of effort to image comparison algorithms, your idea of creating some sort of hash of the image data and comparing that will at least allow you to quickly detect exact duplicates. I'd go with your current plan, but make sure it's a decent (but fast) hash so that the likelihood of collisions is low.

The problem with hashes, as suggested, is that if someone changes 1 pixel the hash turns out completely different.
There are excellent frameworks out there that are able to compare the contents of a file, and return (in a percentage) how much they look alike. There is one in specific, a command line app, I once came across which was build within a scientific environment and it was open source but I can't remember its name.
This kind of framework could definitely help you out, since they can be extremely fast, even with a large number of files.

Upload image to ipfs and store cid. every cid is unique to the file. store thumbnails locally

To give a more relevant answer than my first, I suggest Google Vision API for image recognition (google it haha) or write a simple script to see what goigle lens says about an item.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.