I have over 1.3milion images that I have to compare with each other, and a few hundreds per day are added.
My company take an image and create a version that can be utilized by our vendors.
The files are often very similar to each other, for example two different companies can send us two different images, a JPG and a GIF, both with the McDonald Logo, with months between the submissions.
What is happening is that at the end we find ourselves creating two different times the same logo when we could simply copy/paste the already created one or at least suggest it as a possible starting point for the artists.
I have looked around for algorithms to create a fingerprint or something that will allow me to do a simple query when a new image is uploaded, time is relatively not an issues, if it takes 1 second to create the fingerprint it will take 150 days to create the fingerprints but it will be a great deal in saving that we might even get 3 or 4 servers to do it.
I am fluent in PHP, but if the algorithm is in pseudocode or even C I can read it and try to translate (unless it uses some C specific libraries)
Currently I am doing an MD5 of all the images to catch the ones that are exactly the same, this question came up when I was thinking to do a resize of the image and run the md5 on the resized image to catch the ones that have been saved in a different format and resized, but then I would still not have a good enough recognition.
If I didn't mention it, I will be happy with something that just suggest possible "similar" images.
EDIT
Keep in mind that the check needs to be done multiple times per minute, so the best solution is one that gives me some values per image that I can store and use in the future to compare with the image that I am looking at without having to re-scan the whole server.
I am reading some pages that mention histograms, or resizing the image to a very small size, strip possible tags and then convert it in grayscale, do the hash of that files and use it for comparison. If I am succesful I will post the code/answer here
Try using file_get_contents and:
http://www.php.net/manual/en/function.hash-file.php
If the hashes match, then you know they are the exact same.
EDIT:
If possible I would think storing the image hashes, and the image path in a database table might help you limit server load. It is much easier to run the hash algorithm once on your initial images and store the hash in a table... Then when new images are submitted you can hash the image and then do a lookup on the database table. If the hash is already there discard it. You can use the hash as the table index and so once you find a match you dont need to check the rest.
The other option is to not use a database...But then you would have to always do a n lookup. That is check hash the incoming image and then run in memory a n time search against all saved images.
EDIT #2:
Please view the solution here: Image comparison - fast algorithm
To speedup the process, sort all the files with size and compare internals only if two sizes are equal. To compare internal data, using hash comparison is also fastest way. Hope this helps.
Related
**hello, i'm trying to develop new web site for special purpose, i have list of images in uploaded in sever, i need to upload image from pc and doing search in list of images in server and return list of images has best similarity of uploaded image depend on image color not face all of those using php
this link describe my problem but no codes thanks **
This is a very complex task you are trying to do (and especially hard, because you want to do it in PHP).
What I can think of (in general) to achieve this contains the following sub-tasks:
Recognize colours
Recognize shapes
Recognize the connections of the upper two
In PHP the last two is nearly impossible (and makes no sense as PHP is not an image processing library, there are only basic functions in it). But you can do the first one using this library:
https://github.com/thephpleague/color-extractor
You can make the comparison as fine as you want. Get the most used colours (eg. 1000 one) and compare them as an array. Obviously you won't get an exact match, but if you compare the first 1000 and you find 500 match than that picture is somewhat similar to the other. However you could get completely false results, so this is rather a programmatic solution than a logical one.
I've been refactoring some code and throwing away some old spaghetti. I am now faced with the following issue:
I have tv episodes which have a screenshot source file and 4 thumbnails. The current code generates the paths during the creation of the thumbnails and also when they are loaded. So the actual path to the image is never stored anywhere. It is generated based on the database id of the episode (using md5 hashes).
This quickly became a mess. Now I decided I store the path to the src and all 4 sizes in a simple json array and plug it into the database.
The question is whether this has any significant downsides? The entire json string is always between 500 and 550 chars.
Or should I stick to the on the fly generation of the paths and figure out a more maintainable way of doing so?
I think either way is valid, but find easier to handle md5, as you dont have to handle json deserialization an variable extraction, simply create the hash and file path.
May be the issue has to be with processing of several md5 hashes, vs storing several json data.
Just choose the one you like more.
My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to
1. create some kind of image "hash" of every existing image
2. create a hash of newly uploaded image and compare it with the other in the database
i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems
they need some nonstandard extension to PHP (or are not in PHP at all) - it would be OK for me, but I would like to create it as a plugin to my popular CMS, which is used on many hosting environments without my control.
they are comparing two images but i need to compare one to many (e.g. thousands) and doing it one by one would be very uneffective / slow ...
...
I would be OK to find only VERY similar images (so e.g. different size, resaved jpg or different jpg compression factor).
The only idea I got is to resize the image to e.g. 5px*5px* 256 colors, create a string representation of it and then find the same. But I guess that it may have create tiny differences in colors even with just two same images with different size, so finding just the 100 % same would be useless.
So I would need some good format of that string representation of image which than could be used with some SQL function to find similar, or some other nice way. E.g. phash create perceptional hashes, so when two numbers are close, the images should be close as well, so i just need to find closest distances. But it is again external library.
Is there any easy way?
I've had this exact same issue before.
Feel free to copy what I did, and hopefully it will help you / solve your problem.
How I solved it
My first idea that failed, similar to what you may be thinking, is I ended up making strings for every single image (no matter what size). But I quickly worked out this fills your database super fast, and wasn't effective.
Next option (that works) was a smaller image (like your 5px idea), and I did exactly that, but with 10px*10px images. The way I created the 'hash' for each image was the imagecolorat() function.
See php.net here.
When receiving the rgb colours for the image, I rounded them to the nearest 50, so that the colours were less specific. That number (50) is what you want to change depending on how specific you want your searches to be.
for example:
// Pixel RGB
rgb(105, 126, 225) // Original
rgb(100, 150, 250) // After rounding numbers to nearest 50
After doing this to every pixel (10px*10px will give you 100 rgb()'s back), I then turned them into an array, and stored them in the database as base64_encode() and serialize().
When doing the search for images that are similar, I did the exact same process to the image they wanted to upload, and then extracted image 'hashes' from the database to compare them all, and see what had matching rounded rgb's.
Tips
The Bigger that 50 is in the rgb rounding, the less specific your search will be (and vice versa).
If you want your SQL to be more specific, it may be better to store extra/specific info about the image in the database, so that you can limit the searches you get in the database. eg. if the aspect ratio is 4:3, only pull images around 4:3 from the database. (etc)
It can be difficult to get this perfectly 5px*5px, so a suggestion is phpthumb. I used it with the syntax:
phpthumb.php?src=IMAGE_NAME_HERE.png&w=10&h=10&zc=1
// &w= width of your image
// &h= height of your image
// &zc= zoom control. 0:Keep aspect ratio, 1:Change to suit your width+height
Good luck mate, hope I could help.
For an easy php implementation check out: https://github.com/kennethrapp/phasher
However - I wonder if there is a native mySql function for "compare" (see php class above)
I scale down image to 8x8 then I convert RGB to 1-byte HSV so result hash is 172 bytes string.
HSVHSVHSVHSVHSVHSVHSVHSV... (from 8x8 block, 172 bytes long)
0fff0f3ffff4373f346fff00...
It's not 100% accurate (some duplicates aren't found) but it works nice and looks like there is no false positive results.
Putting it down in an academical way, what you are looking for is a similarity function which takes in two images and returns an indicator how far/similar the two images are. This indicator could easily be a decimal number ranging from -1 to 1 (far apart to very close). Once you have this function you can set an image as a reference and compare all the images against it. Then finding the similar images to one is as simple as finding the closest similarity factor to it which is done with a simple search over a double field within an RDBMS like MySQL.
Now all that remains is how to define the similarity function. To be honest this is problem specific. It depends on what you call similar. But covariance is usually a good starting point, it just needs your two images to be of the same size which I think is of no big deal. Yet you can find lots of other ideas searching for 'similarity measures between two images'.
I have many files of a same picture in various resolution, suitable for every devices like mobile, pc, psp etc. Now I am trying to display only unique pictures in the page, but I dont know how to. I could have avoided this if I maintained a database at the first place, but I didn't. And I need your help detecting the largest unique pictures.
Install gd2 and lib puzzle in your server.
Lib puzzle is astonishing and easy to play with it. Check this snippet
<?php
# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');
# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);
# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
echo "Pictures are looking similar\n";
} else {
echo "Pictures are different, distance=$d\n";
}
# Compress the signatures for database storage
$compress_cvec1 = puzzle_compress_cvec($cvec1);
$compress_cvec2 = puzzle_compress_cvec($cvec2);
Well, even thou there are quite a few algorithms to do that, i believe it would still be faster to do that manually. Download all the images feed them into something like windows live photo gallery or any other software which could match similar images.
This will take you few hours, but implementing image matching algorithm could take far more. After that you could spend extra time on amending your current system to store everything in a DB.
Fix cause of the problem, not it's symptoms.
Firstly, your problem has hardly anything to do with PHP, so I have removed that tag and added more relevant tags.
Smartly doing it will not require NxN comparisions. You can use lots of heuristics, but first I would like to ask you:
Are all the copies of one image exact resize of each other (is there some cropping done - matching cropped images to the original could be more difficult and time consuming)?
Are all images generated (resized) using the same tool?
What about parameters you have used to resize? For example, are all pictures for displaying on PSP in the same resolution?
What is your estimate of how many unique images you have (i.e, how many copies of each picture there might be - on an average)?
Do you have any kind of categorization already done. For example, are all mobile images in separate folder (or of different resolution than the PC images)? This alone could reduce the number of comparisons a lot, even if you do brute force otherwise.
A very top level hint on why you don't need NxN comparisions: you can devise many different approximate hashes (for example, the distribution of high/low frequency jpeg coefficients) and group "potentially" similar images together. This can reduce the number of comparisions required by 10-100 times or even more depending on the quality of heuristic used and the data set. The hashing can even be done on parts of images. 30000 is not a very large number if you use right techniques.
You should check which of the 2 images is the smallest, take the size of that and then compare only the pixels within the rectangle size.
Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.
Possible solutions:
1/ Store forbidden files and compare whole content - works well but is slow.
2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.
3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.
What is the best solution?
Don't calculate checksums, calculate hashes!
I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image.
As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates.
However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.
Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system.
This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.
My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)
Image similarity comparison isn't exactly a trivial problem, so unless you really want to devote a lot of effort to image comparison algorithms, your idea of creating some sort of hash of the image data and comparing that will at least allow you to quickly detect exact duplicates. I'd go with your current plan, but make sure it's a decent (but fast) hash so that the likelihood of collisions is low.
The problem with hashes, as suggested, is that if someone changes 1 pixel the hash turns out completely different.
There are excellent frameworks out there that are able to compare the contents of a file, and return (in a percentage) how much they look alike. There is one in specific, a command line app, I once came across which was build within a scientific environment and it was open source but I can't remember its name.
This kind of framework could definitely help you out, since they can be extremely fast, even with a large number of files.
Upload image to ipfs and store cid. every cid is unique to the file. store thumbnails locally
To give a more relevant answer than my first, I suggest Google Vision API for image recognition (google it haha) or write a simple script to see what goigle lens says about an item.