I read the Wikipedia article about md5 hashes but I still can't understand how a hash can't be "reconstituted" back to the original text.
Could someone explain to someone who knows very little about cryptography how this works? What part of the function makes it one-way?
Since everyone until now has simply defined what a hash function was, I will bite.
A one-way function is not just a hash function -- a function that loses information -- but a function f for which, given an image y ("SE" or 294 in existing answers), it is difficult to find a pre-image x such that f(x)=y.
This is why they are called one-way: you can compute an image but you can't find a pre-image for a given image.
None of the ordinary hash function proposed until now in existing answers have this property. None of them are one-way cryptographic hash functions. For instance, given "SE", you can easily pick up the input "SXXXE", an input with the property that X-encode("SXXXE")=SE.
There are no "simple" one-way functions. They have to mix their inputs so well that not only you don't recognize the input at all in the output, but you don't recognize another input either.
SHA-1 and MD5 used to be popular one-way functions but they are both nearly broken (specialist know how to create pre-images for given images, or are nearly able to do so). There is a contest underway to choose a new standard one, which will be named SHA-3.
An obvious approach to invert a one-way function would be to compute many images and keep them in a table associating to each image the pre-image that produced it. To make this impossible in practice, all one-way function have a large output, at least 64 bits but possibly much larger (up to, say, 512 bits).
EDIT: How do most cryptographic hash functions work?
Usually they have at their core a single function that does complicated transformations on a block of bits (a block cipher). The function should be nearly bijective (it shouldn't map too many sequences to the same image, because that would cause weaknesses later) but it doesn't have to be exactly bijective. And this function is iterated a fixed number of times, enough to make the input (or any possible input) impossible to recognize.
Take the example of Skein, one of the strong candidates for the SHA-3 context. Its core function is iterated 72 times. The only number of iterations for which the creators of the function know how to sometimes relate the outputs to some inputs is 25. They say it has a "safety factor" of 2.9.
Think of a really basic hash - for the input string, return the sum of the ASCII values of each character.
hash( 'abc' ) = ascii('a')+ascii('b')+ascii('c')
= 97 + 98 + 99
= 294
Now, given the hash value of 294, can you tell what the original string was? Obviously not, because 'abc' and 'cba' (and countless others) give the same hash value.
Cryptographic hash functions work the same way, except that obviously the algorithm is much more complex. There are always going to be collisions, but if you know string s hashes to h, then it should be very difficult ("computationally infeasible") to construct another string that also hashes to h.
Shooting for a simple analogy here instead of a complex explanation.
To start with, let's break the subject down into two parts, one-way operations and hashing. What is a one-way operation and why would you want one?
One way operations are called that because they are not reversible. Most typical operations like addition and multiplication can be reversed while modulo division can not be reversed. Why is that important? Because you want to provide a output value which 1) is difficult to duplicate without the original inputs and 2) provides no way to figure out the inputs from the output.
Reversible
Addition:
4 + 3 = 7
This can be reversed by taking the sum and subtracting one of the addends
7 - 3 = 4
Multiplication:
4 * 5 = 20
This can be reversed by taking the product and dividing by one of the factors
20 / 4 = 5
Not Reversible
Modulo division:
22 % 7 = 1
This can not be reversed because there is no operation that you can do to the quotient and the dividend to reconstitute the divisor (or vice versa).
Can you find an operation to fill in where the '?' is?
1 ? 7 = 22
1 ? 22 = 7
With that being said, one-way hash functions have the same mathematical quality as modulo division.
Why is this important?
Lets say I gave you a key to a locker in a bus terminal that has one thousand lockers and asked you to deliver it to my banker. Being the smart guy you are, not to mention suspicious, you would immediately look on the key to see what locker number is written on the key. Knowing this, I've done a few devious things; first I found two numbers that when divided using modulo division gives me a number in the range between 1 and 1000, second I erased the original number and written on it the divisor from the pair of numbers, second I chose a bus terminal that has a guard protecting the lockers from miscreants by only letting people try one locker a day with their key, third the banker already knows the dividend so when he gets the key he can do the math and figure out the remainder and know which locker to open.
If I choose the operands wisely I can get near to a one-to-one relationship between the quotient and the dividend which forces you to try each locker because the answer spreads the results of the possible inputs over the range of desired numbers, the lockers available in the terminal. Basically, it means you can't acquire any knowledge about the remainder even if you know one of the operands.
So, now I can 'trust' you to deliver the key to its rightful owner without worrying that you can easily guess to which locker it belongs. Sure, you could brute force search all the lockers but that would take almost 3 years, plenty of time for my banker to use the key and empty the locker.
See the other answers for more specifics on the different hash functions.
Here's a very simple example. Assume that I'm a beginning cryptographer and I create a hash function that does the following:
int SimpleHash(file) {
return 0 if file.length is even;
return 1 if file.length is odd;
}
Now here's the test. SimpleHash(specialFile) is 0. What was my original file?
Obviously, there's no way to know (although you could likely discover pretty easily that my hash is based on file length). There is no way to "reconstitute" my file based on the hash because the hash doesn't contain everything that my file did.
In simple terms, a hash function works by making a big tangled mess of the input data.
See MD5 for instance. It processes input data by 512-bit blocks. Each block is split into 16 32-bit words. There are 64 steps, each step using one of the 16 input words. So each word is used four times within the course of the algorithm. This is where one-wayness comes from: any input bit is input at several places, and between two such inputs the function mixes all the current data together so that each input bit impacts most of the 128-bit running state. This prevents you from inverting the function, or computing a collision, by looking at only a part of the data. You have to look at the whole 128 bits, and the space of 128-bit blocks is too wide to be efficiently walked through.
Now MD5 does not do a good job at it, since collisions for that function can be found. From a cryptographer point of view, MD5 is a rotated encryption function. The processing of one message block M (512 bits) uses an input state V (a 128-bit value) and computes the new state V' as V' = V + E(M, V) where '+' is a word-wise addition, and 'E' happens to be a symmetric encryption function (aka a 'block cipher') which uses M as key and V as the message to be encrypted. From a closer look, E can is a kind of "extended Feistel network", similar to the DES block cipher, with four quarters instead of two halves. Details are not important here; my point is that what makes a "good" hash function, among hash functions which use that structure (called "Merkle-Damgård"), is similar to what makes a block cipher "secure". The successful collision attacks on MD5 use differential cryptanalysis, a tool which was designed to attack block ciphers in the first place.
From a good block cipher to a good hash function, there is a step which is not to be dismissed. With the Merkle-Damgård structure, the hash function is secure if the underlying block cipher is resistant to "related key attacks", a rather obscure property against which block ciphers are rarely strengthened because, for symmetric encryption, related key attacks barely have any practical impact. For instance, the AES encryption turned out not to be as resistant to related key attacks as could be wished for, and this did not trigger general panic. That resistance was not part of the properties which were sought for when AES was designed. It just prevents turning the AES into a hash function. There is a hash function called Whirlpool, which builds on a derivate of Rijndael, "Rijndael" being the initial name of what became the AES; but Whirlpool takes care to modify the parts of Rijndael which are weak to related key attacks.
Also, there are other structures which can be used for building a hash function. The current standard functions (MD5, SHA-1, and the "SHA-2" family, aka SHA-224, SHA-256, SHA-384 and SHA-512) are Merkle-Damgård functions, but many of the would-be successors are not. There is an ongoing competition, organized by the NIST (the US federal organization which deals with that kind of things), to select a new standard hash function, dubbed "SHA-3". See this page for details. Right now, they are down to 14 candidates from an initial 51 (not counting a dozen extra which failed the administrative test of sending a complete submission with code which compiles and runs properly).
Let's now have a more conceptual look. A secure hash function should look like a random oracle: an oracle is a black box which, when given a message M as input, outputs an answer h(M) which is chosen at random, uniformly, in the output space (i.e. all n-bit strings if the hash function output length is n). If given the same message M again as input, the oracle outputs the same value than previously. Apart from that restriction, the output of the oracle on a non previously used input M is unpredictable. One can imagine the oracle as a container for a gnome who throws dice, and carefully records the input messages and corresponding outputs in a big book, so that he will honor his oracle contract. There is no way to predict what the next output will be since the gnome himself does not know that.
If a random oracle exists, then inverting the hash function has cost 2^n: in order to have a given output, there is no better strategy than using distinct input messages until one yields the expected value. Due to the uniform random selection, probability of success is 1/(2^n) at each try, and the average number of requests to the dice-throwing gnome will be 2^n. For collisions (finding two distinct inputs which yields the same hash value), the cost is about 1.42^(n/2)* (roughly speaking, with 1.42^(n/2)* outputs, we can assemble about 2^n pairs of output, each having a probability of 1/(2^n) of matching, i.e. having two distinct inputs which have the same output). These are the best that can be done with a random oracle.
Therefore, we look for hash functions which are as good as a random oracle: they must mix the input data in such a way that we cannot find a collision more efficiently than what it would cost to simply invoke the function 2^(n/2) times. The bane of hash function is mathematical structure, i.e. shortcuts which allow the attacker to view the hash function internal state (which is big, at least n bits) as a variation on a mathematical object which lives in a much shorter space. 30 years of public research on symmetric encryption systems have produced a whole paraphernalia of notions and tools (diffusion, avalanche, differentials, linearity...) which can be applied. Bottom-line, however, is that we have no proof that a random oracle may actually exist. We want a hash function which cannot be attacked. What we have are hash function candidates, for which no attack is currently known, and, somewhat better, we have some functions for which some kinds of attack can be proven not to work.
There is still some research to be done.
A hash is a (very) lossy encoding.
To give you a simpler example, imagine a fictitious 2-letter encoding of a 5-letter word called the X-encoding. The algorithm for the X-encoding is simple: take the first and last letters of the word.
So,
X-encode( SAUCE ) = SE
X-encode( BLOCK ) = BK
Clearly, you cannot reconstruct SAUCE from its encoding SE (assuming our range of possible inputs is all 5-letter words). The word could just as easily be SPACE.
As an aside, the fact that SAUCE and SPACE both produce SE as an encoding is called a collision, and you can see that the X-ecoding wouldn't make a very good hash. :)
array
With some squinting, associative arrays look very much like hashes. The major differences were the lack of the % symbol on hash names, and that one could only assign to them one key at a time. Thus, one would say $foo{'key'} = 1;, but only #keys = keys(foo);. Familiar functions like each, keys, and values worked as they do now (and delete was added in Perl 2).
Perl 3 had three whole data types: it had the % symbol on hash names, allowed an entire hash to be assigned to at once, and added dbmopen (now deprecated in favour of tie). Perl 4 used comma-separated hash keys to emulate multidimensional arrays (which are now better handled with array references).
Perl 5 took the giant leap of referring to associative arrays as hashes. (As far as I know, it is the first language to have referred to the data structure thus, rather than "hash table" or something similar.) Somewhat ironically, it also moved the relevant code from hash.c into hv.c.
Nomenclature
Dictionaries, as explained earlier, are unordered collections of values indexed by unique keys. They are sometimes called associative arrays or maps. They can be implemented in several ways, one of which is by using a data structure known as a hash table (and this is what Perl refers to as a hash).
Perl's use of the term "hash" is the source of some potential confusion, because the output of a hashing function is also sometimes called a hash (especially in cryptographic contexts), and because hash tables aren't usually called hashes anywhere else.
To be on the safe side, refer to the data structure as a hash table, and use the term "hash" only in obvious, Perl-specific contexts.
I am using crc32 function of PHP to generate numeric equivalent for MongoId since i am using this numeric id in mysql search since string search is slow.
I came across a case where crc32 gives same numeric value for two different mongoid's.
Any help or suggestion would be greatly appreciated.
Thanks
Gaurav
#MarkAdler's answer explains why you are getting a hash collision. But if I were in your shoes, I'd be more interested in what I could do about it.
What you could do, of course, is use a different hashing algorithm that produces longer hashes (less chance of collision), but is still acceptably fast. You'll find a highly-rated review of several alternatives in this question from programmers.stackexchange.com. They all have collisions (coincidentally CRC32 did pretty well in that answer's test sets), but you could try some of them on mongoids and see what happens.
I also found this clever suggestion: To generate a 64-bit hash, you could take two different 32-bit hash algorithms and concatenate the hashes (of course this will more or less halve the speed of your hashing).
The more robust solution would be to write your code with the understanding that a hash is a bucket, and you will sometimes get multiple results (or the wrong result) from a crc32 query. Simply add a second step to check the unhashed Id(s) of the returned records. Since there's only ever going to be a handful of hits, it won't take long at all.
Unless your strings are four bytes or less, then it is inevitable that many strings will have any given CRC-32 value. If you have even just one more than 2^32 possible strings, then it is absolutely guaranteed that at least two of those strings will map to the same CRC-32.
There is no help or suggestion. You cannot expect there to be no collisions, unless there are fewer possible strings than possible CRCs.
By the way, you can construct such cases intentionally with my spoof code, which lets you give it a set of bits you would allow to be changed in a string, and it will tell you which of those bits to flip in order to get a desired CRC.
I am writing a raffle program where people have some tickets, which are marked by natural numbers in the range of 1 to 100 inclusive.
I use mt_rand(1,100) to generate the number of the winning ticket, and then this is outputted to the site, so everyone can see it.
Now I did a little research and found out from the Merseene wiki article that:
Observing a sufficient number of iterations (624 in the case of MT19937, since this is the size of the state vector from which future iterations are produced) allows one to predict all future iterations.
Is the current version used by mt_rand() MT19937?
If so, what can I do to make my generated numbers more cryptographically secure?
Thanks in advance :-)
The short answer:
If so, what can I do to make my generated numbers more cryptographically secure?
You can simply use a random number generator suited for this task instead of mt_rand().
When PHP 7 comes out, you can use random_int() in your projects when a cryptographically secure random number generator is needed.
"Okay, great, but PHP 7 isn't out yet. What do I do today?"
Well, you're in luck, you have two good options available to you.
Use RandomLib. OR
I've been working on backporting PHP 7's CSPRNG functions into PHP 5 projects. It lives on Github under paragonie/random_compat.
"I don't want to use a library; how do I safely roll my own?"
When it comes to cryptography, rolling your own implementation is usually a poor decision. "Not invented here," is usually a good thing. However, if you're dead set on writing your own PHP library to securely generate random integers or strings, there are a few things to keep in mind:
Use a reliable source of randomness. In order of preference, reading from /dev/urandom should be your first choice, followed by mcrypt_create_iv() with MCRYPT_DEV_URANDOM, followed by reading from CAPICOM (Windows only), and lastly openssl_random_pseudo_bytes().
When reading from /dev/urandom, cache your file descriptors to reduce the overhead of each function invocation.
When reading from /dev/urandom, PHP will always buffer 8192 bytes of data (which, likely, you will not use). Be sure to turn read buffering off (i.e. stream_set_read_buffer($fileHandle, 0);).
Avoid any functions or operations that can leak timing information. This means, generally, you want to use bitwise operators instead of math functions (e.g. log()) or anything involving floats.
Don't use the modulo operator to reduce a random integer to a range. This will result in a biased probability distribution:
A good CSPRNG will not fallback to insecure results. Don't silently just use mt_rand() if no suitable CSPRNG is available; instead, throw an uncaught exception or issue a fatal error. Get the developer's attention immediately.
Sorry, but Mersenne Twister was not designed to meet cryptographic requirements. No, you cannot and should not try to fix it, because usually when non-experts try to improve cryptographic functionality, they just end up making things worse.
Php has a long history of problems with its randomness for cryptographic purposes. I'll point out a few references for light reading:
I forgot your password: Randomness attacks against PHP applications
Cracking PHP's lcg_value()
phpwn: Attack on PHP sessions and random numbers
To my knowledge, the best option for secure (pseudo) random number generation in PhP applications is to use openssl_random_pseudo_bytes.
mt_rand by its very name is the Mersenne Twister, a non secure random number generator. Furthermore it is often just seeded with a specific time in ms, something that an attacker can simply guess or aim for.
You cannot make the Mersenne Twister secure. So if anywhere possible you should use a secure random number generator seeded by an entropy source. This entropy source is usually obtained from the operating system. An OpenSSL based one should be preferred.
There is absolutely no reason why you would be stuck with MT. PRNG's are just algorithms. There are plenty of libraries that contain secure PRNG's.
Most applications, especially databases, can sort and filter by small integers or floats much faster than they can do string comparisons.
Therefore I'm wondering if there is a hashing function that I can use to return a 32bit or 64bit number of a short string (about 5 - 40 characters) so that I can compare by integer instead of by string.
I first thought of crc32, but it seems it's much too small of a number and would result in possible collisions in less than 50,000 hashes (I need to do over a million).
I'm mostly interested in working in Python, PHP, V8 Javascript, PostgreSQL, and MySQL.
The problem that collisions become likely at 50k entries is inherent in all 32 bit hashes. If you read a bit on the Birthday problem you'll see that collisions become likely if you have around sqrt(HashSpace) elements, e.g. sqrt(2^32) = 64k for 32 bit hashes.
With 64 bit hashes collisions become much rarer. But I still don't feel too comfortable betting the correctness of my program on that.
Using an approximation from wikipedia:
We obtain a probability of 3*10-8 for 1 million elements, and 3*10-6 for 10 million elements.
You could use CRC64 for that. Or just truncate a crypto hash, such as md5 or sha1 to the desired length.
If a malicious person can choose the strings, breaking your program by deliberately creating collisions, you should at least switch to a keyed hash, such as HMAC.
Depending on what you're doing, you could also simply create an in-memory mapping between string and int where you simply increment a counter for each element you encounter. This gives you a perfect mapping without risk for collisions, but is only applicable in some scenarios.
What's the definition of bias in:
The distribution of mt_rand() return values is biased towards even numbers on 64-bit builds of PHP when max is beyond 2^32.
If it's the kind of bias stated in alternate tie-breaking rules for rounding, I don't think it really matters (since the bias is not really visible).
Besides mt_rand() is claimed to be four times faster than rand(), just by adding three chars in front!
Assuming mt_rand is available, what's the disadvantage of using it?
mt_rand uses the Mersenne Twister algorithm, which is far better than the LCG typically used by rand. For example, the period of an LCG is a measly 232, whereas the period of mt_rand is 219937 − 1. Also, all the values generated by an LCG will lie on lines or planes when plotted into a multidimensional space. Also, it is not only practically feasible, but relatively easy to determine the parameters of an LCG. The only advantage LCGs have is being potentially slightly faster, but on a scale that is completely irrelevant when coding in php.
However, mt_rand is not suitable for cryptographic purposes (generation of tokens, passwords or cryptographic keys) either.
If you need cryptographic randomness, use random_int in php 7. On older php versions, read from /dev/urandom or /dev/random on a POSIX-conforming operating system.
The distribution quirk that you quoted is only relevant when the random number range you're generating is larger than 2^32. That is 4294967296.
If you're working with numbers that big, and you need them to be randomised, then perhaps this is a reason to reconsider using mt_rand(). However if your working with numbers smaller than this, then it is irrelevant.
The reason it happens is due to the precision of the random number generator not being good enough in those high ranges.
I've never worked with random numbers that large, so I've never needed to worry about it.
The difference between rand() and mt_rand() is a lot more than "just three extra characters". They are entirely different function calls, and work in completly different ways. Just the same as you don't expect print() and print_r() to be similar.
mt_rand() gets it's name from the "Mersene Twister" algorithm it uses to generate the random numbers. This algorithm is known to be a quick, efficient and high quality random number generator, which is why it is available in PHP.
The older rand() function makes use of the operating system's random number generator by making a system call. This means that it uses whatever random number generator happens to be the default on the operating system you're using. In general, the default random number generator uses a much slower and older algorithm, hence the claim that my_rand() is quicker, but it will vary from system to system.
Therefore, for virtually all uses, mt_rand() is a better function to use than rand().
You say "assuming mt_rand() is available", but it always will be since it was introduced way back in PHP4.