Sort very similar images PHP OCR

Sort very similar images PHP OCR - php

thanks for looking at my question.
Basically what I'm trying to do is find all images that look like the first and the third image here: http://imgur.com/a/IhHEC
and remove all the ones that don't look like that (2,4).
I've tried several libraries to no avail.
Another acceptable way to do this is to check if the images contain "Code:", as that string is in each one that I have to sort out.
Thank you,
Steve
EDIT: Although the 1st and 3rd images seem like they are the same size, they are not.

If those are the actual images you're going to use, it looks like histogram similarity will do the job. The first and third are very contrasty, the second and fourth, especially the fourth, have a wide range of different intensities.
You could easily make a histogram of the shades of grey in the image and then apply thresholds to the shape of the histogram to classify them.
EDIT: To actually do this: you can iterate through every pixel and create an array of pixel value => number of times found. As it's greyscale you can take either the R, G or B channel. Then divide each number by the number of pixels in the image to normalise, so it will work for any size. Each entry in the histogram will then be a fraction of the number of pixels used. You can then measure the number of values above a certain threshold. If there are lots of greys, you'll get a large number of small values. If there aren't, you'll get a small number of large values.

Due to my background in working more with text from images than image objects, I would do this in post-OCR process, by searching the text content for 'keywords' or checking for 'regular expression' representing your desired data. This means that the entire job needs to be separated into two stages: image-to-text OCR (free or cheap, software or cloud), and actual separation process (simple programming).

Related

Find similar images in (pure) PHP / MySQL

My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to
1. create some kind of image "hash" of every existing image
2. create a hash of newly uploaded image and compare it with the other in the database
i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems
they need some nonstandard extension to PHP (or are not in PHP at all) - it would be OK for me, but I would like to create it as a plugin to my popular CMS, which is used on many hosting environments without my control.
they are comparing two images but i need to compare one to many (e.g. thousands) and doing it one by one would be very uneffective / slow ...
...
I would be OK to find only VERY similar images (so e.g. different size, resaved jpg or different jpg compression factor).
The only idea I got is to resize the image to e.g. 5px*5px* 256 colors, create a string representation of it and then find the same. But I guess that it may have create tiny differences in colors even with just two same images with different size, so finding just the 100 % same would be useless.
So I would need some good format of that string representation of image which than could be used with some SQL function to find similar, or some other nice way. E.g. phash create perceptional hashes, so when two numbers are close, the images should be close as well, so i just need to find closest distances. But it is again external library.
Is there any easy way?

I've had this exact same issue before.
Feel free to copy what I did, and hopefully it will help you / solve your problem.
How I solved it
My first idea that failed, similar to what you may be thinking, is I ended up making strings for every single image (no matter what size). But I quickly worked out this fills your database super fast, and wasn't effective.
Next option (that works) was a smaller image (like your 5px idea), and I did exactly that, but with 10px*10px images. The way I created the 'hash' for each image was the imagecolorat() function.
See php.net here.
When receiving the rgb colours for the image, I rounded them to the nearest 50, so that the colours were less specific. That number (50) is what you want to change depending on how specific you want your searches to be.
for example:
// Pixel RGB
rgb(105, 126, 225) // Original
rgb(100, 150, 250) // After rounding numbers to nearest 50
After doing this to every pixel (10px*10px will give you 100 rgb()'s back), I then turned them into an array, and stored them in the database as base64_encode() and serialize().
When doing the search for images that are similar, I did the exact same process to the image they wanted to upload, and then extracted image 'hashes' from the database to compare them all, and see what had matching rounded rgb's.
Tips
The Bigger that 50 is in the rgb rounding, the less specific your search will be (and vice versa).
If you want your SQL to be more specific, it may be better to store extra/specific info about the image in the database, so that you can limit the searches you get in the database. eg. if the aspect ratio is 4:3, only pull images around 4:3 from the database. (etc)
It can be difficult to get this perfectly 5px*5px, so a suggestion is phpthumb. I used it with the syntax:
phpthumb.php?src=IMAGE_NAME_HERE.png&w=10&h=10&zc=1
// &w= width of your image
// &h= height of your image
// &zc= zoom control. 0:Keep aspect ratio, 1:Change to suit your width+height
Good luck mate, hope I could help.

For an easy php implementation check out: https://github.com/kennethrapp/phasher
However - I wonder if there is a native mySql function for "compare" (see php class above)

I scale down image to 8x8 then I convert RGB to 1-byte HSV so result hash is 172 bytes string.
HSVHSVHSVHSVHSVHSVHSVHSV... (from 8x8 block, 172 bytes long)
0fff0f3ffff4373f346fff00...
It's not 100% accurate (some duplicates aren't found) but it works nice and looks like there is no false positive results.

Putting it down in an academical way, what you are looking for is a similarity function which takes in two images and returns an indicator how far/similar the two images are. This indicator could easily be a decimal number ranging from -1 to 1 (far apart to very close). Once you have this function you can set an image as a reference and compare all the images against it. Then finding the similar images to one is as simple as finding the closest similarity factor to it which is done with a simple search over a double field within an RDBMS like MySQL.
Now all that remains is how to define the similarity function. To be honest this is problem specific. It depends on what you call similar. But covariance is usually a good starting point, it just needs your two images to be of the same size which I think is of no big deal. Yet you can find lots of other ideas searching for 'similarity measures between two images'.

How can you find the "majority colors" of an image using PHP?

How can I calculate what the majority colors of an image are in PHP? I'd prefer to group different shades of a similar color into a single bucket, so for example all shades of blue are just counted as "blue".
In other words, I'd like a function that takes an image and returns a simple array similar to:
"blue":90%, "white":10%
No need for high accuracy, just enough to categorize the images by dominant and sub-dominant colors. Thanks!

Here's one approach:
1) Define a set of colours which we'll call centroids -- these are the middle of the basic colours you want to break images into. You can do this using a clustering algorithm like k-means, for example. So now you've got, say, 100 centroids (buckets, you can think of them as), each of which is an RGB colour triple with a name you can manually attach to it.
2) To generate the histogram for a new image:
open the image in gd or whatever
convert it to an array of pixel values (e.g. using imagecolorat)
determine the distance (euclidean distance is ok) between the pixel value and all the centroids. Classify each pixel as to which bucket it's closest to.
Your output is a centroid assignment for each pixel. Or, given you just want a histogram, you can just count how many times each centroid occurs.
Bear in mind that this kind of colour assignment is somewhat subjective. I'm not sure there'll be a definitive mapping from colours to names (e.g., it's language dependent). But if you google, there might exist a look-up table that you could use, although I've not come across one.
Hope this helps!
Ben

Is there a way to find images of a certain color from specified sites?

First off, I don't mean google image search!
I would like to give users the ability to select a hex color value and then have a search programatically return (from specified sites/directories online) images where the dominant color is the color they specified (or close to it).
Is there a technology that can do this? I'd prefer PHP/MySQL, but I'd be willing to use other languages if it would be simpler.
EDIT
Taking several suggestions, I managed to find this: http://www.coolphptools.com/color_extract which does a decent job at extracting the most common colors from the image.
The next step is calculating distance from the extracted colors to the color being searched for. I have no issue implementing it except I'm unclear on the best way to calculate the color distance?
I've scoured this site and google for a concrete answer, but come up dry. The tool above extracts colors into hex color codes. I am currently converting this to RGB and using those.
Should I attempt to convert RGB to Y'UV? I'm attempting that by using:
sqrt(((r - r1) * .299)^2 + ((g - g1) * .587)^2 + ((b - b1) * .114)^2)
(based on an answer here: RGB to closest predefined color)
It's not very accurate. What should I swap that color distance formula with so it calculates accurate color distance (to the human eye)?

Interesting.
The first problem is: "What is the dominant colour of an image?" Maybe the one most pixels have. What do you do with similar shades of the same colour? Would you cluster around similar colours?
I would implement it this way:
Grab all images inside your search paths. Cluster the colors used in each of them and the biggest cluster is the dominant color. You will have to play around a bit with cluster sizes and number of clusters. If this color is within a certain range of hue, saturation and brightness of your searched color it is a match.

Firstly, I wonder how can you crawl over the sites/directories to search for a particular image color, unless you have a big list of websites. If it isn't related to your question then just ignore it.
Back to your question, I personally think this is an interesting question as well. Since it requires quite a few research, I just want to point out some ideas for you to reference.
What you need to do is to get user-specified hex colors and convert them into RGB colors, because most of the image functions in PHP that I know only work with RGB. Now, if you have a list of directories that you can search for, then just crawl over them and use some basic functions to get hold of the desired webpage' contents (e.g. file_get_contents, or cURL). Once you have the contents of a specific page, you will need to use DOM functions to get images' URLs from that page (you can work it out yourself, using: getElementsByTagName() and getAttribute()). Now assuming that you are holding a list of image URLs, now you need to get their colors and try to match them with your user-specified colors (remember to convert everything into RGB).
In PHP we have a very convenient GD library that works with images. If your server support GD2 then you can have a look at imagecolorclosest(). This function "Returns the index of the color in the palette of the image which is "closest" to the specified RGB value". Note that the function only returns the closest match (not exactly match), so you have to do some comparisons to choose the right images (I believe this is easy because you now have RGB colors with very handy values to work with, say, using some subtraction and adjustment method).
Moreover, not only the images, when you have a specific page content, you can try to search for the color scheme of that page (by getting its "background-color" value), there are quite a few details that you can get and play around with :) Of course, an image's color is somehow related to its page's styling scheme colors, think logically wider.
If I'm saying something not clear, don't hesitate to comment on my reply :)
Happy coding.

Cleaning pixels from a map

We have this map, we need to use PHP to take all the shades of blue out, as well as the percentages. The problem is, is that some of the percentages are in the same color as the borders, and other times, the percentages go into the border. We need to use this image.

There are not (AFAIK) really easy ways.
The easiest way doesn't give you good results: separate channels and delete small components.
The result is this:
As you can see there are a few numbers and percent signs remaining because they are connected to the delimiting lines and deleting small components doesn't work for them.
If you need a better job, you should correlate the image with a template of each number, and once identified, delete it.
Here you can see the result of the correlation with the number "2":
One wrong "2" is identified, (see top left), so a more sophisticated approach may be needed for a general procedure.
Anyway, I think these kind of manipulation is well beyond what you can expect from K-12.
HTH!
Edit
As per your request, some detail about the first method.
You first separate the three channels, and get three images:
You keep the third (the blue channel)
Then you need to delete the smaller components. There are a lot of methods to do that, probably the easiest is derived from the connectivity detection for example in the flood-fill algorithm, but you just measure the components without filling them.
The basic (not optimized) idea is to travel every pixel in the image and count how many pixels are "connected" with it. If there are less than a specified number (threshold), you just delete the whole set. Most image manipulation libraries have all these functions already implemented.

For this specific image, if you open the image in image editing software, convert the mode from index to true color (RGB), and then color dodge the entire image with yellow (RGB: 255,255,0), you wind up with a black and white image consisting of the outlines and numbers. (this is also what the blue channel looks like BTW)
So either throw away the red and green channels, or implement a color dodge algorithm.
Another alternative is to sample each pixel, and the set that pixel's R & G components to the B value
edit: actually, I forgot about the white numbers. to get those, flood fille the outer white with the rgb(0,0,255), invert the entire image, and color dodge with (255,255,0), the red or green channel is now the missing numbers. Overlay these on top of the processed image from previous steps above.

Getting rid of the shaded colors should be easy.
Getting rid of the numbers is more tricky. I would:
Make a lookup table of the pixel data associated with each number and the % sign.
When clearing an area, look for the numbers (black or white) and only clear out exact patterns from the lookup table.
Recreate the border between areas by adding a black color between different shades.
It's impossible to do this with guaranteed accuracy simply because the digits hide original information. However, I think the above steps would give you close to 100% accuracy without a lot of effort.

Image comparison with php + gd

What's the best approach to comparing two images with php and the Graphic Draw (GD) Library?
This is the scenario:
I have an image, and I want to find which image of a given set is the most similar to it.
The most similar image is in fact the same image, not pixel perfect match but the same image.
I've dramatised the difference between the two images with the number one on the example just to ease the understanding of what I meant.
Even though it brought no consistent results, my approach was to reduce the images to 1px using the imagecopyresampled function and see how close the RGB values where between images.
The sum of the values of deducting each red, green and blue decimal equivalent value from the red, green and blue decimal equivalent value of the possible match gave me a dissimilarity index that, even though it didn't work as expected since not always the most RGB similar image was the target image, I could use to select an image from the available targets.
Here's a sample of the output when comparing 4 images against a target image, in this case the apple logo, that matches one of them but is not exactly the same:
Original image:
Red:222 Green:226 Blue:232
Compared against:
http://a1.twimg.com/profile_images/571171388/logo-twitter_normal.png
Red:183 Green:212 Blue:212 and an index of similarity of 56
Red:117 Green:028 Blue:028 and an index of dissimilarity 530
Red:218 Green:221 Blue:221 and an index of dissimilarity 13 Matched Correctly.
Red:061 Green:063 Blue:063 and an index of dissimilarity 491
May not even be doable better with better results than what I'm already getting and I'm wasting my time here but since there seems to be a lot of experienced php programmers I guess you can point me in the right directions on how to improve this.
I'm open to other image libraries such as iMagick, Gmagick or Cairo for php but I'd prefer to avoid using other languages than php.
Thanks in advance.

I'd have thought your approach seems reasonable, but reducing an entire image to 1x1 pixel in size is probably a step too far.
However, if you converted each image to the same size and then computed the average colour in each 16x16 (or 32x32, 64x64, etc. depending on how much processing time/power you wish to use) cell you should be able to form some kind of sensible(-ish) comparison.

I would suggest, like middaparka, that you do not downsample to a 1 pixel only image, because you loose all the spatial information. Downsampling to 16x16 (or 32x32, etc.) would certainly provide better results.
Then it also depends on whether color information is important or not to you. From what I understand you could actually do without it and compute a gray-level image starting from your color image (e.g. luma) and compute the cross-correlation. If, like you said, there is a couple of images that matches exactly (except for color information) this should give you a pretty good reliability.

I used the ideas of scaling, downsampling and gray-level mentioned in the question and answers, to apply a Mean Squared Error between the pixels channels values for 2 images, using GD Library.
The code is in this answer, including a test with those ideas.
Also I did some benckmarking and I think the downsampling could be not needed in those little images, cause the method is fast (being PHP), just a fraction of a second.

Using middparka's methods, you can transform each image into a sequence of numeric values and then use the Levenshtein algorithm to find the closest match.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.