I need to know if exists any form to get a unique hash from gif images, i did tried with SHA1 file function
sha1_file
but i don't know if exist the case where two hash of different gif images, result in same hash.
Its can happen with SHA1? In this case is better SHA2, or MD5? Or any other previously implemented in PHP language.
I know its also depends of file size, but gifs image don't exceed 10mb in any case.
I need recommendations for this problem. best regards.
There is no hash function that creates different values for each and every set of images you provide. This should be obvious as your hash values are much shorter than the files themselves and therefore they are bound to drop some information on the way. Given a fixed set of images it is rather simple to produce a perfect hash function (e.g. by numbering them), but this is probably not the answer you are looking for.
On the other hand you can use "perfect hashing", a two step hashing algorithm that guarantees amortized O(1) access using a two step hashing algorithm, but as you are asking for a unique 'hash' that may also not be what you are looking for. Could you be a bit more specific about why you insist on the hash-value being unique and under what circumstances?
sha1_file is fine.
In theory you can run into two files that hash to the same value, but in practice it is so stupendously unlikely that you should not worry about it.
Hash functions don't provide any guarantees about uniqueness. Patru explains why, very well - this is the pigeonhole principle, if you'd like to read up.
I'd like to talk about another aspect, though. While you won't get any theoretical guarantees, you get a practical guarantee. Consider this: SHA-256 generates hashes that are 256 bits long. That means there are 2256 possible hashes it can generate. Assume further that the hashes it generates are distributed almost purely randomly (true for SHA-256). That means that if you generate a billion hashes a second, 24 hours a day, you'll have generated 31,536,000,000,000,000 hashes a year. A lot, right?
Divide that by 2256. That's ~1060. If you walked linearly through all possible hashes, that's how many years it would take you to generate all possible hashes (pack a lunch). Divide that by two, that's... still ~1060. That's how many years you'd have to work to have a greater than 50% chance of generating the same hash twice.
To put it another way, if you generate a billion hashes a second for a century, you'd have a 1/1058 chance of generating the same hash twice. Until the sun burns out, 1/1050.
Those are damn fine chances.
Related
The initial task is to process image, hash it, make some heavy image work and store this hash and work results in database,
during next request with same image I want to compare the image hashes with hashes I have in database and load database-cached results to reduce amount of heavy work.
So the questions are, what to hash? with what to hash?
I see good php implementations of phash but seems to be it is great for similarity check, but we need exact matching.
Is phash fine for exact mathing also?
Thank you!
PHP provides a built-in function for this, which is probably the easiest solution:
$hash = hash_file("sha1", '/path/to/image');
You can use this check for exact matches. There is a small chance of collisions, but you can help mitigate that by also using the file path or database ID in your comparison.
The answers in this similar question provide more options.
This question already has answers here:
Is "double hashing" a password less secure than just hashing it once?
(16 answers)
Closed 9 years ago.
I've read many posts on SO on how you should implement password hashing. And I've read that you shouldn't hash the password many times (well, it doesn't help much, it is said). But why not? If I iterate the hashed passwords, let's say, 10,000,000 times (because user can wait 3 seconds to have his registration completed, or I could just do that by sending an AJAX request).
So, how an attacker, stolen my database, and even knowing that I just iterate the password 10,000,000 times (worst-case scenario), could possibly find out users' passwords? He couldn't create a rainbow table, as it would take him very long (hashing passwords takes time, and hashing the hash so many times takes much more time), brute-force is also not really possible, so what's left?
evening: I wasn't saying anything about bcrypt or PBKDF.
Your question implicitly screams "I am trying to kludge my way around having to use bcrypt/PBKDF by poorly imitating their methods". However, the problems raised in the duplicate question are the reason why these new algos were devised instead of simply re-hashing a key X times.
You want a simple answer? Then: yes. X+1 hashing rounds are more secure than just X hashing rounds, but only marginally so. You might spend a second or two computing the hash on your server by looping over $hash = hash('sha512', $hash); But an attacker is going to use the Slide Attack to cut that down to a fraction of the time, and on top of that they're likely going to parallelize the attack across a few AWS instances, a farm of graphics cards, or a botnet.
The methods that PBKDF and bcrypt employ go quite a ways towards minimalizing/negating the effect of the slide attack, and bcrypt does some sort of magic voodoo that prevents it from being parallelizable to some extent.
Because of exist slide attack, which independent of number of cypher/hash rounds. See:
http://en.wikipedia.org/wiki/Slide_attack
Because of MD5's way of encoding it always outputs a string with the same length (32 characters for instance). In essence this means that a string with "I am a string" potentially can have the same hash as the "Whooptydoo" string, although this is a very (to the power of 100) small chance, it is still a chance.
This also means that repeat calculating the hash on your string a X number of times doesn't change the probability of it being cracked, as it doesn't encode it more deeply then it already was.
I hope I explained it clear enough, please comment if I have missed something.
I'm assuming you are talking about hashing the password, and then hashing the hash, hashing THAT hash, etc. You could easily then create a rainbow table that just maps hash values to the hash-of-hash values.
So, if HASH(PASSWORD) = H, and then HASH(H) = H1, HASH(H1) = H2, and so forth, then an easily downloadable rainbow table could contain a list like PASSWORD | H and reverse lookup the password that way. Now just, in addition, have a file that looks like H | H10000 and reverse that as well.
So, you've inconvenienced a would be hacker, maybe. But it's not really "more secure" because it's just a longer road, not really a more treacherous or difficult one.
I want to use a unique ID generated by PHP in a database table that will likely never have more than 10,000 records. I don't want the time of creation to be visible or use a purely numeric value so I am using:
sha1(uniqid(mt_rand(), true))
Is it wrong to use a hash for a unique ID? Don't all hashes lead to collisions or are the chances so remote that they should not be considered in this case?
A further point: if the number of characters to be hashed is less than the number of characters in a sha1 hash, won't it always be unique?
If you have 2 keys you will have a theoretical best case scenario of 1 in 2 ^ X probability of a collision, where X is the number of bits in your hashing algorithm. 'Best case' because the input usually will be ASCII which doesn't utilize the full charset, plus the hashing functions do not distribute perfectly, so they will collide more often than the theoretical max in real life.
To answer your final question:
A further point: if the number of characters to be hashed is less than
the number of characters in a sha1 hash, won't it always be unique?
Yeah that's true-sorta. But you would have another problem of generating unique keys of that size. The easiest way is usually a checksum, so just choose a large enough digest that the collision space will be small enough for your comfort.
As #wayne suggests, a popular approach is to concatenate microtime() to your random salt (and base64_encode to raise the entropy).
How horrible would it be if two ended up the same? Murphy's Law applies - if a million to one, or even a 100,000:1 chance is acceptable, then go right ahead! The real chance is much, much smaller - but if your system will explode if it happens then your design flaw must be addressed first. Then proceed with confidence.
Here is a question/answer of what the probabilities really are: Probability of SHA1 Collisions
Use sha1(time()) in stead, then you remove the random possibility of a repeating hash for as long as time can be represented shorter than the sha1 hash. (likely longer than you fill find a working php parser ;))
Computer random isn't actually random, you know?
The only true random that you can obtain from a computer, supposing you are on a Unix environment is from /dev/random, but this is a blocking operation that depends on user interactions like moving a mouse or typing on keyboard. Reading from /dev/urandom is less safe, but it's probably better thang using just ASCII characters and gives you instantaneous response.
sha1($ipAddress.time())
Causes it's impossible for anyone to use same IP address same time
I am making a classified ads site with Zend Framework (for portfolio purposes, yes I know the world doesn't have room for "yet another Craigslist clone"). I am trying to implement the ability to post/edit/delete without ever needing an account.
To do this, I feel like I need to have a Nonce generated upon post submission and stored in the database. Then email a link to the user which makes a GET request for the delete, like this:
http://www.somesite.com/post/delete/?id=123&nonce=2JDXS93JFKS8204HJTHSLDH230945HSLDF
Only the user has this unique key or nonce, and upon submission I check the database under the post's ID and ensure the nonce matches prior to deleting.
My issue is how secure the nonce actually is. If I use Zend Framework's Zend_Form_Element_Hash, it creates the hash like this:
protected function _generateHash()
{
$this->_hash = md5(
mt_rand(1,1000000)
. $this->getSalt()
. $this->getName()
. mt_rand(1,1000000)
);
$this->setValue($this->_hash);
}
In reading about mt_rand(), one commenter said "This function has limited entrophy. So, if you want to create random string, it will produce only about 2 billion different strings, no matter the length of the string. This can be serous security issue if you are using such strings for session indentifiers, passwords etc."
Due to the lifetime of the nonce/token in the application, which could be days or weeks before user chooses to delete post, I think more than enough time would be given for a potential hack.
I realize mt_rand() is a huge upgrade from rand() as seen in this visual mapping pixels with rand on the left, and mt_rand on the right. But is it enough? What makes "2 billion different strings" a security issue?
And ultimately, how can I increase the entropy of a nonce/token/hash?
For such security it's not only important how long your output is. It counts how much randomness you've used to create it.
For mt_rand() the source of randomness is its seed and state (number of times you've used it since it was seeded). More mt_rand() calls will just give you more rehasing of the same randomness source (no new entropy).
mt_rand()'s seed is only 32-bit (anything less than 128bit makes cryptographers suspicious ;)
Strength of a keys with 32-bits of entropy is 4 billion divided by (roughly) number of keys you'll generate (e.g. after 100K uses there will be ~1:43000 chance to guess any valid key, which approaches practical brute-forcing).
You're adding salt to this, which makes it much stronger, because in addition to guessing the seed attacker would have to know the salt as well, so if the salt is long, then overall the key may be quite strong despite "low" entropy.
To increase entropy you need to add more random stuff (even slightly random is OK too, just gives less bits) from different sources than mt_rand: microtime(), amount of memory used, process ID... or just use /dev/random, which collects all entropy it can get.
(edit: uniqid() has weak entropy, so it won't help here)
The Zend hash generating code above's input for the md5() hashing function has 1,000,000 X 1,000,000 different possibilities. md5() has 32^16 (1208925819614629174706176) possible outcomes no matter what the input is. On average, the hacker would need to send 500,000,000,000 requests to your server in order to guess the right nonce.
At 100 requests per minute, that's about 3472222 days to hack.
Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?
For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?
Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?
One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...
Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.
sha1_file good enough?
Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:
function is_duplicate_file( $file1, $file2)
{
if(filesize($file1) !== filesize($file2)) return false;
if( sha1_file($file1) == sha1_file($file2) ) return true;
return false;
}
md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.
Scalability?
There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:
1- Direct file compare:
if( file_get_contents($file1) != file_get_contents($file2) )
2- Sha1_file
if( sha1_file($file1) != sha1_file($file2) )
3- md5_file
if( md5_file($file1) != md5_file($file2) )
The results:
2 files 1.2MB each were compared 100 times, I got the following results:
--------------------------------------------------------
method time(s) peak memory
--------------------------------------------------------
file_get_contents 0.5 2,721,576
sha1_file 1.86 142,960
mdf5_file 1.6 142,848
file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.
Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.
md5_file might be a better option because it is a little faster than sha1.
So the conclusion is that it depends, if you want faster compare, or less memory usage.
As per my comment on #ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.
From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?
In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.
SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:
A.1. SHA1 Weaknesses
As time passes, cryptographers discover more and more SHA1
weaknesses. Already, finding hash
collisions is feasible for well-funded organizations. Within
years, perhaps even a typical PC will
have
enough computing power to silently corrupt a Git repository.
Hopefully Git will migrate to a better hash function before further
research destroys SHA1.
You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.
Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.
With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.