Avoiding hash collision in php when using sha1 for hashing

Avoiding hash collision in php when using sha1 for hashing - php

Suppose i assume if hash collision occur while i am using sha1 function in php .
Will this code avoids it permanently or do i have to use any other way
$filename=sha1($filename.'|'.microtime());
OR
$filename=sha1($filename.'|'.rand());
If no this code doesn't provide protection from hash collision .
What should i do to avoid any type of hash collision if i assume there can be more than 100,000 entries in db.

Its very unlikely that a hash collision will happen for sha1.
Probability of sha1 collision is negligible
And hash collision risk is not practical. No one has found sha1 collision till yet . So you are safe to use it.
Using a salt like microtime or random number may decreases the chances of probability but you simply can't avoid it.
And what you are using is sha1(string) whether string is a mixed value or single string.so using microtime and rand function wont affect anything to probability of hash collision.
Therefore there can be possibility that sha1(mixedvalue) collision might be equal or greater than collision of sha1(filename) so certainly that is of no use.
So dont worry and use this or simple way if you like to, it wont create problem in future, Thinking about hash collision is waste of time when the chances are very very very less.

Just to be clear, you can't completely avoid hash collisions. It's an infinite number of inputs to a finite number of outputs, but you can take into account things like the file's size, the current system time and other data to use as a salt which will increase the entropy of your message digests.

Just sha1() the entire file path, not only the file name.
Filename xy.png can be only one in a directory, therefore your hash will be unique for that filename.
Also, this has the advantage that you will not have duplicate files (while with rand()/microtime() you can get same file 10 times in same dir, and if it's a 1GB file can cause problems)

Neither of these avoid hash collision.
Hash collisions happen when you have an algorithm that generates a hash of a specific size, regardless of the starting value.
A hash collision is when two different values, like "mypassword" and "dsjakfuiUIs2kh-1jlks" end up generating the same hash because of the mathematical operations performed on them.
You can't write code to prevent hash collisions, how often that happens is dependent on the hashing algorithm you are using.

Related

Is substr safe for password reset?

I was wondering if using substr(md5(rand()), 0, 17); would be safe enough for a password reset link? If I was to generate a longer string would that make it any safer? Is MD5 at all safe? Or should I do $token = sha1(uniqid($username, true));?

The use of substr() or md5() is secondary to the use of rand().
The whole point of using password reset tokens is that they're unpredictable and rand() is known to be weak due to the underlying LCG model.
It would be a better idea to use the system's entropy source instead, e.g.:
$rand = openssl_random_pseudo_bytes(8); // take 8 random bytes
$token = substr(md5($rand), 0, 17);
It takes bytes from the system's random source, e.g. /dev/urandom on Linux or the corresponding system for Windows.
Note that if you don't have any particular size constraint you might as well choose a full sha1() output and take 16 random bytes.
Also, you should treat password reset tokens as if they were (temporary, time limited) passwords when you store them in your database; I would suggest to send above token to the user and then use password_hash() before you write them in the database. At a later stage you check the given token (assuming it's not expired) using password_verify().

For a random hash it's safe enough. Your issue would be collisions and the first 17 characters of a MD5 on a random value should be random enough to avoid them on a light duty project.
I would pick uniqid() with extra entropy over rand() (and maybe even mt_rand()).
I wouldn't use MD5 or SHA1 for storing your passwords, however.

I don't see a point in using substr() unless you have some sort of length constraint that makes sense. The larger the key the better. In general, it is a good idea to use hashes in their complete form. If MD5 was deemed "secure enough" trimmed down, it would already be trimmed down. The more trimming, the higher chance of a collision.
I prefer to use GUIDs for password reset links. GUIDs are as effectively unpredictable (and secure) as an MD5 hash of a random number and they are both 128-bit values.
Make sure to use an expiration timestamp for the reset token. I usually use 24 hours.
Don't use the system clock or any derivative. You want to ensure that a hacker can't reset another user's password, while recording the timestamp, and then guess the reset token to generate a reset URL. So don't use values directly based on the system clock or anything else predictable.
You also should use a failure / max retry count on single password reset token, just like regular logins, to limit the number of attacks that are possible. If a hacker knows the userid and is trying to guess the password reset URL, you should track the number of tries against a given userid as a login attempt, and lock the account accordingly. At most the hacker gets 3 tries, then lock the account for an hour. In that case, a substr() of a MD5 is still pretty secure.

Is it wrong to use a hash for a unique ID?

I want to use a unique ID generated by PHP in a database table that will likely never have more than 10,000 records. I don't want the time of creation to be visible or use a purely numeric value so I am using:
sha1(uniqid(mt_rand(), true))
Is it wrong to use a hash for a unique ID? Don't all hashes lead to collisions or are the chances so remote that they should not be considered in this case?
A further point: if the number of characters to be hashed is less than the number of characters in a sha1 hash, won't it always be unique?

If you have 2 keys you will have a theoretical best case scenario of 1 in 2 ^ X probability of a collision, where X is the number of bits in your hashing algorithm. 'Best case' because the input usually will be ASCII which doesn't utilize the full charset, plus the hashing functions do not distribute perfectly, so they will collide more often than the theoretical max in real life.
To answer your final question:
A further point: if the number of characters to be hashed is less than
the number of characters in a sha1 hash, won't it always be unique?
Yeah that's true-sorta. But you would have another problem of generating unique keys of that size. The easiest way is usually a checksum, so just choose a large enough digest that the collision space will be small enough for your comfort.
As #wayne suggests, a popular approach is to concatenate microtime() to your random salt (and base64_encode to raise the entropy).

How horrible would it be if two ended up the same? Murphy's Law applies - if a million to one, or even a 100,000:1 chance is acceptable, then go right ahead! The real chance is much, much smaller - but if your system will explode if it happens then your design flaw must be addressed first. Then proceed with confidence.
Here is a question/answer of what the probabilities really are: Probability of SHA1 Collisions

Use sha1(time()) in stead, then you remove the random possibility of a repeating hash for as long as time can be represented shorter than the sha1 hash. (likely longer than you fill find a working php parser ;))

Computer random isn't actually random, you know?
The only true random that you can obtain from a computer, supposing you are on a Unix environment is from /dev/random, but this is a blocking operation that depends on user interactions like moving a mouse or typing on keyboard. Reading from /dev/urandom is less safe, but it's probably better thang using just ASCII characters and gives you instantaneous response.

sha1($ipAddress.time())
Causes it's impossible for anyone to use same IP address same time

md5 hash of hash

This is theoretical question but I am curious about it. What if I do this (code in PHP, but the language isn't really matter in this case):
$value = ''; //starting value
$repeat = false;
while(true)
{
$value = md5($value);
/*Save values in database, one row per value*/
/*Check for repeated hash value in db, and set $repeat flag true if there is one*/
if($repeat)break;
}
As you can see I suspect that there will be repeated hash values. I think there is no way that every existing text has its own value as it should mean that every hash value has its own and that doesn't make sense.
My questions are: Is there any article about this "problem" out there? It can happen I got the same value in one system for example when I hash files for check if they are valid? Can this caused problems anywhere in any system?

If you care about multiple texts hashing to the same value, don't use MD5. MD5 has fast collision attacks, which violated the property you want. Use SHA-2 instead.
When using a secure hash function, collisions for 128 hashes are extremely difficult to find, and by that I mean that I know of no case where it happened. But if you want to avoid that chance, simply use 256 bit hashes. Then finding a collision using brute-force is beyond the computational power of all humanity for now. In particular there is no known message pair for which SHA-256(m1) == SHA-256(m2) with m1 != m2.
You're right that hashed can't be unique(See Pidgeonhole principle), but the chances of you actually finding such a case are extremely low. So don't bother with handling that case.
I typically aim for a 128 bit security level, so when I need a collision free hash function, I use a 256 bit hash function, such as SHA-256.
With your hash chain you won't find a collision, unless you're willing to wait for a long time. Collisions become likely once you have around 2^(n/2) times, which is 2^64 in the case of 128 bit hashes such as md5. I know of no brute-force collisions against a 128 bit hash. The only collisions I know are carefully crafted messages that exploit weaknesses in the hashing scheme you use (those exist against md5).

Hash it multiple times by same method or different method, Then it would be nearly impossible to repeat its self, Also check if they repeat then repeat the hash function until the values are different, Then save in database or use it where ever you like...

Many hash iterations: append salt every time?

I have used unsalted md5/sha1 for long time, but as this method isn't really secure (and is getting even less secure as time goes by) I decided to switch to a salted sha512. Furthermore I want to slow the generation of the hash down by using many iterations (e.g. 100).
My question is whether I should append the salt on every iteration or only once at the beginning. Here are the two possible codes:
Append every time:
// some nice big salt
$salt = hash($algorithm, $salt);
// apply $algorithm $runs times for slowdown
while ($runs--) {
$string = hash($algorithm, $string . $salt, $raw);
}
return $string;
Append once:
// add some nice big salt
$string .= hash($algorithm, $salt);
// apply $algorithm $runs times for slowdown
while ($runs--) {
$string = hash($algorithm, $string, $raw);
}
return $string;
I first wanted to use the second version (append once) but then found some scripts appending the salt every time.
So, I wonder whether adding it every time adds some strength to the hash. For example, would it be possible that an attacker found some clever way to create a 100timesSha512 function which were way faster than simply executing sha512 100 times?

In short: Yes. Go with the first example... The hash function can lose entropy if feed back to itself without adding the original data (I can't seem to find a reference now, I'll keep looking).
And for the record, I am in support of hashing multiple times.
A hash that takes 500 ms to generate is not too slow for your server (considering that generating hashes are typically not done the vast majority of requests). However a hash that takes that long will significantly increase the time it will take to generate a rainbow table...
Yes, it does expose a DOS vulnerability, but it also prevents brute force attacks (or at least makes them prohibitively slow). There is absolutely a tradeoff, but to some the benefits exceed the risks...
A reference (more like an overview) to the entire process: Key Strengthening
As for the degenerating collisions, the only source I could find so far is this discussion...
And some more discussion on the topic:
HEKS Proposal
SecurityFocus blog on hashing
A paper on Oracle's Password Hashing Algorithms
And a few more links:
PBKDF2 on WikiPedia
PBKDF2 Standard
A email thread that's applicable
Just Hashing Is Far From Enough Blog Post
There are tons of results. If you want more, Google hash stretching... There's tons of good information out there...

In addition to re-hashing it multiple times, I would use a different salt for each password/user. Though I think 5000 iterations is a bit too much, try a lower number. There's a trade-off here; you'll have to tweak it according to your needs and hardware.
With different salts for each password, an attacker would be forced to bruteforce each password individually instead of constructing a rainbow table, which increases the workload considerably.
As always, here's a recommended read for this: Just hashing is far from enough
EDIT: Iterative hashing is a perfectly valid tactic. There are trade-offs, but everything has them. If you are worried about computation time, why not just store the plaintext password?

Please please please do not roll your own crypto. This is what libraries like OpenSSL are for. Here's few good examples of how you would use it to make salted hashes.
Salted Hashes in OpenSSL

The reason for iterative hashing is to make process as slow as possible. So you can do even better: use different salts for each iteration. It can be done by encrypting you original data again and again on each iteration with fixed key and XORing with salt value.

I prefer to go with a double sha1 with two different salts and prevent DoS delaying the answer incrementally (with a simple usleep) for every invalid password check.

Why is MD5'ing a UUID not a good idea?

PHP has a uniqid() function which generates a UUID of sorts.
In the usage examples, it shows the following:
$token = md5(uniqid());
But in the comments, someone says this:
Generating an MD5 from a unique ID is
naive and reduces much of the value of
unique IDs, as well as providing
significant (attackable) stricture on
the MD5 domain. That's a deeply
broken thing to do. The correct
approach is to use the unique ID on
its own; it's already geared for
non-collision.
Why is this true, if so? If an MD5 hash is (almost) unique for a unique ID, then what is wrong from md5'ing a uniqid?

A UUID is 128 bits wide and has uniqueness inherent to the way it is generated. A MD5 hash is 128 bits wide and doesn't guarantee uniquess, only a low probablity of collision. The MD5 hash is no smaller than the UUID so it doesn't help with storage.
If you know the hash is from a UUID it is much easier to attack because the domain of valid UUIDs is actually fairly predictable if you know anything about the machine geneerating them.
If you needed to provide a secure token then you would need to use a cryptographically secure random number generator.(1) UUIDs are not designed to be cryptographically secure, only guaranteed unique. A monotonically increasing sequence bounded by unique machine identifiers (typically a MAC) and time is still a perfectly valid UUID but highly predictable if you can reverse engineer a single UUID from the sequence of tokens.
The defining characteristic of a cryptographically secure PRNG is that the result of a given iteration does not contain enough information to infer the value of the next iteration - i.e. there is some hidden state in the generator that is not revealed in the number and cannot be inferred by examining a sequence of numbers from the PRNG. If you get into number theory you can find ways to guess the internal state of some PRNGs from a sequence of generated values. Mersenne Twister is an example of such a generator. It has hidden state that it used to get its long period but it is not cryptographically secure - you can take a fairly small sequence of numbers and use that to infer the internal state. Once you've done this you can use it to attack a cryptographic mechanism that depends on keeping that sequence a secret.

Note that uniqid() does not return a UUID, but a "unique" string based on the current time:
$ php -r 'echo uniqid("prefix_", true);'
prefix_4a8aaada61b0f0.86531181
If you do that multiple times, you will get very similar output strings and everyone who is familiar with uniqid() will recognize the source algorithm. That way it is pretty easy to predict the next IDs that will be generated.
The advantage of md5()-ing the output, along with an application-specific salt string or random number, is a way harder to guess string:
$ php -r 'echo md5(uniqid("prefix_", true));'
3dbb5221b203888fc0f41f5ef960f51b
Unlike plain uniqid(), this produces very different outputs every microsecond. Furthermore it does not reveil your "prefix salt" string, nor that you are using uniqid() under the hood. Without knowing the salt, it is very hard (consider it impossible) to guess the next ID.
In summary, I would disagree with the commentor's opinion and would always prefer the md5()-ed output over plain uniqid().

MD5ing a UUID is pointless because UUIDs are already unique and fixed length (short), properties that are some of the reasons that people often use MD5 to begin with. So I suppose it depends on what you plan on doing with the UUID, but in general a UUID has the same properties as some data that has been MD5'd, so why do both?

UUIDs are already unique, so there is no point in MD5'ing them anyway.
Regarding the security question, in general you can be attacked if the attacker can predict what the next unique ID will be you are about to generate. If it is known that you generate your unique IDs from UUIDs, the set of potential next unique IDs is much smaller, giving a better chance for a brute force attack.
This is especially true if the attacker can get a whole bunch of unique IDs from you, and that way guess your scheme of generating UUIDs.

Version 3 of UUIDs are already MD5'd, so there's no point in doing it again. However, I'm not sure what UUID version PHP uses.

As an aside, MD5 is actually obsolete and is not to be used in anything worth protecting - PHI, PII or PCI - from 2010 onwards. The US Feds have ennforced this and any entity non-compliant would be paying lots of $$$ in penalty.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.