Image comparison with php + gd

Image comparison with php + gd - php

What's the best approach to comparing two images with php and the Graphic Draw (GD) Library?
This is the scenario:
I have an image, and I want to find which image of a given set is the most similar to it.
The most similar image is in fact the same image, not pixel perfect match but the same image.
I've dramatised the difference between the two images with the number one on the example just to ease the understanding of what I meant.
Even though it brought no consistent results, my approach was to reduce the images to 1px using the imagecopyresampled function and see how close the RGB values where between images.
The sum of the values of deducting each red, green and blue decimal equivalent value from the red, green and blue decimal equivalent value of the possible match gave me a dissimilarity index that, even though it didn't work as expected since not always the most RGB similar image was the target image, I could use to select an image from the available targets.
Here's a sample of the output when comparing 4 images against a target image, in this case the apple logo, that matches one of them but is not exactly the same:
Original image:
Red:222 Green:226 Blue:232
Compared against:
http://a1.twimg.com/profile_images/571171388/logo-twitter_normal.png
Red:183 Green:212 Blue:212 and an index of similarity of 56
Red:117 Green:028 Blue:028 and an index of dissimilarity 530
Red:218 Green:221 Blue:221 and an index of dissimilarity 13 Matched Correctly.
Red:061 Green:063 Blue:063 and an index of dissimilarity 491
May not even be doable better with better results than what I'm already getting and I'm wasting my time here but since there seems to be a lot of experienced php programmers I guess you can point me in the right directions on how to improve this.
I'm open to other image libraries such as iMagick, Gmagick or Cairo for php but I'd prefer to avoid using other languages than php.
Thanks in advance.

I'd have thought your approach seems reasonable, but reducing an entire image to 1x1 pixel in size is probably a step too far.
However, if you converted each image to the same size and then computed the average colour in each 16x16 (or 32x32, 64x64, etc. depending on how much processing time/power you wish to use) cell you should be able to form some kind of sensible(-ish) comparison.

I would suggest, like middaparka, that you do not downsample to a 1 pixel only image, because you loose all the spatial information. Downsampling to 16x16 (or 32x32, etc.) would certainly provide better results.
Then it also depends on whether color information is important or not to you. From what I understand you could actually do without it and compute a gray-level image starting from your color image (e.g. luma) and compute the cross-correlation. If, like you said, there is a couple of images that matches exactly (except for color information) this should give you a pretty good reliability.

I used the ideas of scaling, downsampling and gray-level mentioned in the question and answers, to apply a Mean Squared Error between the pixels channels values for 2 images, using GD Library.
The code is in this answer, including a test with those ideas.
Also I did some benckmarking and I think the downsampling could be not needed in those little images, cause the method is fast (being PHP), just a fraction of a second.

Using middparka's methods, you can transform each image into a sequence of numeric values and then use the Levenshtein algorithm to find the closest match.

Related

What is the effect of 'posterizing' an image?

I am working at the moment on an issue where we are seeing CPU Usage issues on a particular host when converting images using iMagick. The issue is pretty perfectly described here:
https://github.com/ResponsiveImagesCG/wp-tevko-responsive-images/issues/150 (I don't use that particular library, but I DO use the same responsive images classes they do, and I am timing out on that particular line, only for some images).
They seem to suggest that removing the call to ->posterizeImage() will fix their issue, and in my tests it does, I can't even tell any difference in the converted images. But this worries me because I wonder if there is a difference that I am not seeing, or one that only comes up in certain scenarios (I mean if posterizing an image didn't do anything there wouldn't be a method for it, right?). I see online that it 'Reduces the image to a limited number of color level' (136 levels in the case causing an issue for me, for what it's worth). I'm having some difficulty parsing that though, which I think is related to a poor grasp of the way various image formats store data (really it doesn't go past the idea that an image is broken up into pixels, which are broken up into proportions of red green and blue).
What actual visual differences could I expect to see if we stop posterizing images? Is it something that I would only expect in certain types of image (like, would it be more visible in transparent over non-transparent, or warmer coloured images)? Or that would be more evident in certain display styles (like print, or the warmer colour temp in iPhone displays)?
Basically I am looking for the info to make an informed choice on whether it's safe to comment out. I'm not worried if it means some images might be x Kb larger, but if it will make them look poor quality, or distort them in some way (even in corner cases) then I need to consider other options.

From the ImageMagick command line documentation:
-posterize levels
reduce the image to a limited number of color levels per channel.
Very low values of levels, e.g., 2, 3, 4, have the most visible effect.
There is a bit more info in the Color Quantization examples - it also has some example images:
The operators original purpose (using an argument of '2') is to re-color images using just 8 basic colors, as if the image was generated using a simple and cheap poster printing method using just the basic colors. Thus the operator gets its name.
...
An argument of '3' will map image colors based on a colormap of 27 colors, including mid-tone colors. While an argument of '4' will generate a 64 color colortable, and '5' generates a 125 color colormap.
Essentially it reduces the number of colors used in the image - and by extension the size. Using a level of 136 would not have much visible affect, as this translates to a 2,515,456 color colortable (136^3).
It is also worth noting from the commit for the issue you linked is that this isn't even always an effective way of reducing image size:
... it turns out that posterization only improves file sizes
for PNGs and can actually lead to slightly larger file sizes for
JPG images.

Posterisation is a reduction of the amount of colour information stored in an image - as such, it is really a decrease in quality. It's hard to imagine how stopping doing this could be detrimental. And, if it turns out later that there is/was a legitimate reason for doing it, you can always do it later because if you stop doing it now, you will still have all the original information.
If it was the other way around, and you started to introduce posterisation and later found out it was undesirable for some reason, you would no longer be able to get the original information back.
So, I would see no harm in stopping posterising. And the fact that I have written that, kind of challenges anyone who knows better to speak up and tell me I am wrong :-)

Image comparison in php

My scenario is as follows:
I have to save 1000 of images in database, and then I have to compare new image with database images for matches (match should be 70% or more) to get the best match image from database in php.
is there any algorithm or method for fast comparison with better result ...
Thanks in advance :)

I would suggest you use a Perceptual Hash or similar - mainly for reasons of performance. In essence, you create a single number, or hash, for each image ONCE in your database at the point where you insert it, and retain that hash in the database. Then when you get a new image to insert, you calculate its hash and compare it to the PRE-CALCULATED hash of all the other images so that you don't have to drag all the megabytes of pixels of your existing images from disk to compare them.
The best pHASHes are scale-invariant and image format invariant. Here is an article by Dr Neal Krawetz... Perceptual Hashing.
ImageMagick can also do Perceptual Hashing and is callable from PHP - see here.

Try this class. It support get hash string from image to store in database and compare with new image later:
https://github.com/nvthaovn/CompareImage
It is very fast and accurate, although not optimal code. I have 20000 pictures in my database.

This depends entirely on how smart you want the algorithm to be.
For instance, here are some issues:
cropped images vs. an uncropped image
images with a text added vs. another without
mirrored images
The easiest and simplest algorithm I've seen for this is just to do the following steps to each image:
scale to something small, like 64x64 or 32x32, disregard aspect ratio, use a combining scaling algorithm instead of nearest pixel
scale the color ranges so that the darkest is black and lightest is white
rotate and flip the image so that the lighest color is top left, and then top-right is next darker, bottom-left is next darker (as far as possible of course)
Edit A combining scaling algorithm is one that when scaling 10 pixels down to one will do it using a function that takes the color of all those 10 pixels and combines them into one. Can be done with algorithms like averaging, mean-value, or more complex ones like bicubic splines.
Then calculate the mean distance pixel-by-pixel between the two images.
To look up a possible match in a database, store the pixel colors as individual columns in the database, index a bunch of them (but not all, unless you use a very small image), and do a query that uses a range for each pixel value, ie. every image where the pixel in the small image is between -5 and +5 of the image you want to look up.
This is easy to implement, and fairly fast to run, but of course won't handle most advanced differences. For that you need much more advanced algorithms.

Find similar images in (pure) PHP / MySQL

My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to
1. create some kind of image "hash" of every existing image
2. create a hash of newly uploaded image and compare it with the other in the database
i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems
they need some nonstandard extension to PHP (or are not in PHP at all) - it would be OK for me, but I would like to create it as a plugin to my popular CMS, which is used on many hosting environments without my control.
they are comparing two images but i need to compare one to many (e.g. thousands) and doing it one by one would be very uneffective / slow ...
...
I would be OK to find only VERY similar images (so e.g. different size, resaved jpg or different jpg compression factor).
The only idea I got is to resize the image to e.g. 5px*5px* 256 colors, create a string representation of it and then find the same. But I guess that it may have create tiny differences in colors even with just two same images with different size, so finding just the 100 % same would be useless.
So I would need some good format of that string representation of image which than could be used with some SQL function to find similar, or some other nice way. E.g. phash create perceptional hashes, so when two numbers are close, the images should be close as well, so i just need to find closest distances. But it is again external library.
Is there any easy way?

I've had this exact same issue before.
Feel free to copy what I did, and hopefully it will help you / solve your problem.
How I solved it
My first idea that failed, similar to what you may be thinking, is I ended up making strings for every single image (no matter what size). But I quickly worked out this fills your database super fast, and wasn't effective.
Next option (that works) was a smaller image (like your 5px idea), and I did exactly that, but with 10px*10px images. The way I created the 'hash' for each image was the imagecolorat() function.
See php.net here.
When receiving the rgb colours for the image, I rounded them to the nearest 50, so that the colours were less specific. That number (50) is what you want to change depending on how specific you want your searches to be.
for example:
// Pixel RGB
rgb(105, 126, 225) // Original
rgb(100, 150, 250) // After rounding numbers to nearest 50
After doing this to every pixel (10px*10px will give you 100 rgb()'s back), I then turned them into an array, and stored them in the database as base64_encode() and serialize().
When doing the search for images that are similar, I did the exact same process to the image they wanted to upload, and then extracted image 'hashes' from the database to compare them all, and see what had matching rounded rgb's.
Tips
The Bigger that 50 is in the rgb rounding, the less specific your search will be (and vice versa).
If you want your SQL to be more specific, it may be better to store extra/specific info about the image in the database, so that you can limit the searches you get in the database. eg. if the aspect ratio is 4:3, only pull images around 4:3 from the database. (etc)
It can be difficult to get this perfectly 5px*5px, so a suggestion is phpthumb. I used it with the syntax:
phpthumb.php?src=IMAGE_NAME_HERE.png&w=10&h=10&zc=1
// &w= width of your image
// &h= height of your image
// &zc= zoom control. 0:Keep aspect ratio, 1:Change to suit your width+height
Good luck mate, hope I could help.

For an easy php implementation check out: https://github.com/kennethrapp/phasher
However - I wonder if there is a native mySql function for "compare" (see php class above)

I scale down image to 8x8 then I convert RGB to 1-byte HSV so result hash is 172 bytes string.
HSVHSVHSVHSVHSVHSVHSVHSV... (from 8x8 block, 172 bytes long)
0fff0f3ffff4373f346fff00...
It's not 100% accurate (some duplicates aren't found) but it works nice and looks like there is no false positive results.

Putting it down in an academical way, what you are looking for is a similarity function which takes in two images and returns an indicator how far/similar the two images are. This indicator could easily be a decimal number ranging from -1 to 1 (far apart to very close). Once you have this function you can set an image as a reference and compare all the images against it. Then finding the similar images to one is as simple as finding the closest similarity factor to it which is done with a simple search over a double field within an RDBMS like MySQL.
Now all that remains is how to define the similarity function. To be honest this is problem specific. It depends on what you call similar. But covariance is usually a good starting point, it just needs your two images to be of the same size which I think is of no big deal. Yet you can find lots of other ideas searching for 'similarity measures between two images'.

Is there a way to find images of a certain color from specified sites?

First off, I don't mean google image search!
I would like to give users the ability to select a hex color value and then have a search programatically return (from specified sites/directories online) images where the dominant color is the color they specified (or close to it).
Is there a technology that can do this? I'd prefer PHP/MySQL, but I'd be willing to use other languages if it would be simpler.
EDIT
Taking several suggestions, I managed to find this: http://www.coolphptools.com/color_extract which does a decent job at extracting the most common colors from the image.
The next step is calculating distance from the extracted colors to the color being searched for. I have no issue implementing it except I'm unclear on the best way to calculate the color distance?
I've scoured this site and google for a concrete answer, but come up dry. The tool above extracts colors into hex color codes. I am currently converting this to RGB and using those.
Should I attempt to convert RGB to Y'UV? I'm attempting that by using:
sqrt(((r - r1) * .299)^2 + ((g - g1) * .587)^2 + ((b - b1) * .114)^2)
(based on an answer here: RGB to closest predefined color)
It's not very accurate. What should I swap that color distance formula with so it calculates accurate color distance (to the human eye)?

Interesting.
The first problem is: "What is the dominant colour of an image?" Maybe the one most pixels have. What do you do with similar shades of the same colour? Would you cluster around similar colours?
I would implement it this way:
Grab all images inside your search paths. Cluster the colors used in each of them and the biggest cluster is the dominant color. You will have to play around a bit with cluster sizes and number of clusters. If this color is within a certain range of hue, saturation and brightness of your searched color it is a match.

Firstly, I wonder how can you crawl over the sites/directories to search for a particular image color, unless you have a big list of websites. If it isn't related to your question then just ignore it.
Back to your question, I personally think this is an interesting question as well. Since it requires quite a few research, I just want to point out some ideas for you to reference.
What you need to do is to get user-specified hex colors and convert them into RGB colors, because most of the image functions in PHP that I know only work with RGB. Now, if you have a list of directories that you can search for, then just crawl over them and use some basic functions to get hold of the desired webpage' contents (e.g. file_get_contents, or cURL). Once you have the contents of a specific page, you will need to use DOM functions to get images' URLs from that page (you can work it out yourself, using: getElementsByTagName() and getAttribute()). Now assuming that you are holding a list of image URLs, now you need to get their colors and try to match them with your user-specified colors (remember to convert everything into RGB).
In PHP we have a very convenient GD library that works with images. If your server support GD2 then you can have a look at imagecolorclosest(). This function "Returns the index of the color in the palette of the image which is "closest" to the specified RGB value". Note that the function only returns the closest match (not exactly match), so you have to do some comparisons to choose the right images (I believe this is easy because you now have RGB colors with very handy values to work with, say, using some subtraction and adjustment method).
Moreover, not only the images, when you have a specific page content, you can try to search for the color scheme of that page (by getting its "background-color" value), there are quite a few details that you can get and play around with :) Of course, an image's color is somehow related to its page's styling scheme colors, think logically wider.
If I'm saying something not clear, don't hesitate to comment on my reply :)
Happy coding.

What is a good algorithm or library for cropping images to avoid whitespace or empty areas?

I have a whole bunch of images of illustrations that I would like to crop to a smaller preview size.
The problem is that I want to crop them to show an "interesting" part of the illustration (ie avoid areas of whitespace).
The images typically have a flat color or a subtle gradient for the background. They are mostly vector style artwork with fairly distinct shapes.
Here are some examples: link ;-)
I've been thinking about using some sort of image feature detection algorithm with a sliding window to find the area with the greatest number of features.
I'm implementing this in PHP, but I don't mind implementing it myself if there isn't a library or extension available.
Ideas?

ImageMagick has a trim operation. It's available as a library but I don't know how hard it is to use from PHP. There are some PHP interfaces.

OK, so here's what I would've done, after looking at the examples:
Sum all rows and all columns of each image. You'll get two arrays, both looking like this:
/-----\ /--\
_/ -- |
___- \_________
By looking at these arrays for a few images, find a suitable threshold (probably something just above zero). Then the leftmost and the rightmost crossing of this threshold is where you have to crop. I hope I've managed to make it clear enough, if not -- ask!

Here's a fairly simple approach using an edge-detection filter, and then cropping around the center-of-edginess of the image to generate a thumbnail. It works pretty well on most images, but not if there are more than one subject. I'm open to suggestions on other ways of identifying the "interesting" points in a source image.

Well, you might want to consider just using an edge detection algorithm. Pick the area with the largest number of edges. Give higher weight to edges that are not blurry (as they may be from the background).

ImageMagick for PHP has automated generation of thumbnails. This SO question has a link to an ImageMagick auto-crop operator, and I'm not sure, but I think this is the PHP interface to it.
From the link:
bool Imagick::trimImage ( float
$fuzz )
Remove edges that are
the background color from the image.
For more general "interestingness", maybe try an inverse of seam carving (to find the highest energy, rather than lowest energy areas).

A CLI program using http://pecl.php.net/package/imagick:
<?php
dl('imagick.so');
$img = new Imagick();
$img->readImage($argv[1]);
# (* 0.0: exact match; * 1.0: crop entire image)
$fuzz = current($img->getQuantumRange()) * 0.25;
$img->trimImage($fuzz);
$img->writeImage($argv[2]);
?>
It should work good enough, as long as the image doesn't have a frame around its border.

Drupal has a project called smartcrop, which has PHP code to find highest entropy and "interesting" areas in images. See the output examples.
You should be able to use the functions in the module and the library in none-Drupal projects too.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.