Image file cheksum as a unique content compare optimalisation

Image file cheksum as a unique content compare optimalisation - php

Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.
Possible solutions:
1/ Store forbidden files and compare whole content - works well but is slow.
2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.
3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.
What is the best solution?

Don't calculate checksums, calculate hashes!
I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image.
As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates.
However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.
Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system.
This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.
My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)

Image similarity comparison isn't exactly a trivial problem, so unless you really want to devote a lot of effort to image comparison algorithms, your idea of creating some sort of hash of the image data and comparing that will at least allow you to quickly detect exact duplicates. I'd go with your current plan, but make sure it's a decent (but fast) hash so that the likelihood of collisions is low.

The problem with hashes, as suggested, is that if someone changes 1 pixel the hash turns out completely different.
There are excellent frameworks out there that are able to compare the contents of a file, and return (in a percentage) how much they look alike. There is one in specific, a command line app, I once came across which was build within a scientific environment and it was open source but I can't remember its name.
This kind of framework could definitely help you out, since they can be extremely fast, even with a large number of files.

Upload image to ipfs and store cid. every cid is unique to the file. store thumbnails locally

To give a more relevant answer than my first, I suggest Google Vision API for image recognition (google it haha) or write a simple script to see what goigle lens says about an item.

Related

more reliable way to check for duplicate files with PHP

I have seen the questions and answers about using md5 and sha1 hashes as well as the file sizes to compare the files, I setup the system and while this works most of the time there is one specific case where it matches false positives with identical md5, sha1 hashes and file size.
Specifically my case it happens when users upload pictures directly from ipad and iphones even though the images are actually totally different.
I am wondering if anyone knows of a different method to add to the check that is more reliable and unique.
Thank you.
edit: also using the file size in bytes

Use a combination of MD5 and File size, this should be very accurate.
Are you sure that there is no error when creating the hash? It is very unlikely to get a lot of false positives.

Compare Image with a Photo of the Image

I've a Problem to solve, but couldn't find a solution.
I need to compare a original image with a photo of the same image, and the function should return true if the photos are equal or false if the photos are not equal.
The photo can also have another size as the original image, and also if the photo contains only a part of the original it should detect the original.
Can I use normal face detection library's or do you have a better solution to solve this problem?
Thanks

There are several ways you could approach this problem. If you are looking to see if images are EXACTLY the same. You could go after the file. Using an md5 comparison you can help determine to see if it's the exact same file. Now this won't work for ACTUALLY comparing them.
If you want to actually compare the contents of the pictures you have, I suggest taking a look at PHP's gd library.
After some googling around I found a nice blog entry here about comparing the similarity of images. It's a good read.
A good method to start off with when comparing photos with GD is making the images the same size. The size should be reasonable so I'd say somewhere around 16x16. You should then consider RGB values, shapes, etc.
Some other libraries I should point you to are libpuzzle and imagemagick. Both of which make it pretty easy for comparing images in PHP. The documentation is pretty bad though so it may require a lot more googling and actual testing. Good luck!

Similar images - how to compare them

I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.
My company take an image and create a version that can be utilized by our vendors.
The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.
What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.
I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.
I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)
Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.
If I didn't mention it, I will be happy with something that just suggest possible "similar" images.
EDIT
Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.
I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here

Try using file_get_contents and:
http://www.php.net/manual/en/function.hash-file.php
If the hashes match, then you know they are the exact same.
EDIT:
If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.
The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.
EDIT #2:
Please view the solution here: Image comparison - fast algorithm

To speedup the process, sort all the files with size and compare internals only if two sizes are equal. To compare internal data, using hash comparison is also fastest way. Hope this helps.

Find similar images in (pure) PHP / MySQL

My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to
1. create some kind of image "hash" of every existing image
2. create a hash of newly uploaded image and compare it with the other in the database
i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems
they need some nonstandard extension to PHP (or are not in PHP at all) - it would be OK for me, but I would like to create it as a plugin to my popular CMS, which is used on many hosting environments without my control.
they are comparing two images but i need to compare one to many (e.g. thousands) and doing it one by one would be very uneffective / slow ...
...
I would be OK to find only VERY similar images (so e.g. different size, resaved jpg or different jpg compression factor).
The only idea I got is to resize the image to e.g. 5px*5px* 256 colors, create a string representation of it and then find the same. But I guess that it may have create tiny differences in colors even with just two same images with different size, so finding just the 100 % same would be useless.
So I would need some good format of that string representation of image which than could be used with some SQL function to find similar, or some other nice way. E.g. phash create perceptional hashes, so when two numbers are close, the images should be close as well, so i just need to find closest distances. But it is again external library.
Is there any easy way?

I've had this exact same issue before.
Feel free to copy what I did, and hopefully it will help you / solve your problem.
How I solved it
My first idea that failed, similar to what you may be thinking, is I ended up making strings for every single image (no matter what size). But I quickly worked out this fills your database super fast, and wasn't effective.
Next option (that works) was a smaller image (like your 5px idea), and I did exactly that, but with 10px*10px images. The way I created the 'hash' for each image was the imagecolorat() function.
See php.net here.
When receiving the rgb colours for the image, I rounded them to the nearest 50, so that the colours were less specific. That number (50) is what you want to change depending on how specific you want your searches to be.
for example:
// Pixel RGB
rgb(105, 126, 225) // Original
rgb(100, 150, 250) // After rounding numbers to nearest 50
After doing this to every pixel (10px*10px will give you 100 rgb()'s back), I then turned them into an array, and stored them in the database as base64_encode() and serialize().
When doing the search for images that are similar, I did the exact same process to the image they wanted to upload, and then extracted image 'hashes' from the database to compare them all, and see what had matching rounded rgb's.
Tips
The Bigger that 50 is in the rgb rounding, the less specific your search will be (and vice versa).
If you want your SQL to be more specific, it may be better to store extra/specific info about the image in the database, so that you can limit the searches you get in the database. eg. if the aspect ratio is 4:3, only pull images around 4:3 from the database. (etc)
It can be difficult to get this perfectly 5px*5px, so a suggestion is phpthumb. I used it with the syntax:
phpthumb.php?src=IMAGE_NAME_HERE.png&w=10&h=10&zc=1
// &w= width of your image
// &h= height of your image
// &zc= zoom control. 0:Keep aspect ratio, 1:Change to suit your width+height
Good luck mate, hope I could help.

For an easy php implementation check out: https://github.com/kennethrapp/phasher
However - I wonder if there is a native mySql function for "compare" (see php class above)

I scale down image to 8x8 then I convert RGB to 1-byte HSV so result hash is 172 bytes string.
HSVHSVHSVHSVHSVHSVHSVHSV... (from 8x8 block, 172 bytes long)
0fff0f3ffff4373f346fff00...
It's not 100% accurate (some duplicates aren't found) but it works nice and looks like there is no false positive results.

Putting it down in an academical way, what you are looking for is a similarity function which takes in two images and returns an indicator how far/similar the two images are. This indicator could easily be a decimal number ranging from -1 to 1 (far apart to very close). Once you have this function you can set an image as a reference and compare all the images against it. Then finding the similar images to one is as simple as finding the closest similarity factor to it which is done with a simple search over a double field within an RDBMS like MySQL.
Now all that remains is how to define the similarity function. To be honest this is problem specific. It depends on what you call similar. But covariance is usually a good starting point, it just needs your two images to be of the same size which I think is of no big deal. Yet you can find lots of other ideas searching for 'similarity measures between two images'.

How to detect similar Images in PHP?

I have many files of a same picture in various resolution, suitable for every devices like mobile, pc, psp etc. Now I am trying to display only unique pictures in the page, but I dont know how to. I could have avoided this if I maintained a database at the first place, but I didn't. And I need your help detecting the largest unique pictures.

Install gd2 and lib puzzle in your server.
Lib puzzle is astonishing and easy to play with it. Check this snippet
<?php
# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');
# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);
# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
echo "Pictures are looking similar\n";
} else {
echo "Pictures are different, distance=$d\n";
}
# Compress the signatures for database storage
$compress_cvec1 = puzzle_compress_cvec($cvec1);
$compress_cvec2 = puzzle_compress_cvec($cvec2);

Well, even thou there are quite a few algorithms to do that, i believe it would still be faster to do that manually. Download all the images feed them into something like windows live photo gallery or any other software which could match similar images.
This will take you few hours, but implementing image matching algorithm could take far more. After that you could spend extra time on amending your current system to store everything in a DB.
Fix cause of the problem, not it's symptoms.

Firstly, your problem has hardly anything to do with PHP, so I have removed that tag and added more relevant tags.
Smartly doing it will not require NxN comparisions. You can use lots of heuristics, but first I would like to ask you:
Are all the copies of one image exact resize of each other (is there some cropping done - matching cropped images to the original could be more difficult and time consuming)?
Are all images generated (resized) using the same tool?
What about parameters you have used to resize? For example, are all pictures for displaying on PSP in the same resolution?
What is your estimate of how many unique images you have (i.e, how many copies of each picture there might be - on an average)?
Do you have any kind of categorization already done. For example, are all mobile images in separate folder (or of different resolution than the PC images)? This alone could reduce the number of comparisons a lot, even if you do brute force otherwise.
A very top level hint on why you don't need NxN comparisions: you can devise many different approximate hashes (for example, the distribution of high/low frequency jpeg coefficients) and group "potentially" similar images together. This can reduce the number of comparisions required by 10-100 times or even more depending on the quality of heuristic used and the data set. The hashing can even be done on parts of images. 30000 is not a very large number if you use right techniques.

You should check which of the 2 images is the smallest, take the size of that and then compare only the pixels within the rectangle size.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.