So I am working on a piece of code that computes the hashes of 2^4 sets of 3 random prime numbers (less than 2^8). Then keep selecting sets of 3 composite numbers (less than 2^8) until there is a set of {c1, c2, c3} with a hash value that matches one of the previous hashes (the prime ones), that set would be known as {p1,p2,p3}.
From my understanding the birthday attack is basically finding two functions that provide the same result. So I would create 2 functions? One for the prime numbers and then another for composite? What would the best way of doing this be? I am thinking PHP as the language.
Any help would be greatly appreciated.
I think the premise is looking for a set of any 3 numbers < 2^8 that produces the same hash value as a set of 3 prime numbers using the same hash function.
Not stated is the range of the hash value.
The birthday attack is based on the fact that since the range of the hash value is limited, a brute force method that tries hashing all combinations of 3 numbers < 2^8 is likely to produce some collisions with valid hash values well before actually trying all possible combinations. However, in this case, trying all combinations of 3 numbers < 2^8 only takes 16777216 loops, so a complete brute force approach can be used.
The program could create a histogram of all the possible hash values . Since there are only 54 primes < 2^8, generating the histogram for all valid inputs (3 primes) would take 54^3 = 157464 loops.
Checking for collisions using all sets of 3 numbers < 2^8 would take 2^24 = 16777216 loops, which shouldn't take too long depending on the hash algorithm.
Related
I'm having a question regarding the uniqueness of md5 function.
I know that md5 (with microtime value) are not unique, however, they are pretty unique :)
How can I calculate the probability of a collision between two portions of an md5 hashes?
For example: The following in php that generates a 8 chars string from md5 result:
substr(md5(microtime()), 0, 8);
A second scenario - What if the index of it is unique (so it gets a different portion of the hash each time)?
substr(md5(microtime()), rand(0, 32), 8);
There are 2^32 combinations of 8 hexadecimal digits. Even if they are completely random, you can only generate about 65000 such strings, on average, before you get 2 that are the same.
md5(), using a random index or not, doesn't significantly change anything as long as all the microtime() values use use are unique. But, if you are generating these too fast, or across many machines, then the situation is much much worse, because there's a good chance you could end up using the same microtime() value twice.
As you are asking about uniqueness of your string, it's actually a probability. Means as much string character you will use and as much the length of random string you make will get less chances of getting similar random string.
So, to get unique string you need to store string in your DB and compare with random string, if you found similar then again go for new fresh string , until you get unique string.
It depends on how many "sub-hashes" you are going to generate and how many bits you're keeping from the original MD5 hash (length of a "sub-hash"). If you generate just 1 sub-hash and keep just 1 bit then no collision at all. If you generate 2 sub-hashes expect 50% collision. Use 2 bits and the odds are 25%. You do the math. Refer to the birthday paradox for more info
I dont find any info about % of collisions for xxhash64.
I'm going to use it for cache system (to generate hash keys which need to be unique, about a hundreds millions).
Now i use md5, but i don't need cryptographic property.
So i need some info, to decide does is it a good decision for my task.
In best case - comparison of the number of collisions between md5 and xxHash64.
You can calculate yourself by using the birthday problem.
In general the mathematical expression that gives you the probability of hash function is :
p(k) = 1 - exp(-k(k-1)/2N, k (number of hashes) randomly generated values, where each value is a non-negative integer less than N (number of possible hashes):
N = 2^(number of bit), example for md5 it is 2^128, or 2^32 for 32 bit-hash
If you use md5
will produce a 128-bit hash value, by applying this formula you get this 'S' graph. This graph explains, for example, in order to get a collison probability of 50% (0.5), you need at least 21 000 000 trillion of hashes or 21 quintillion of hashes!!!! If you we use less than, for instance 1 billion of hashes, the probability of collision is negligible.
If you are using hundred millions of hashed keys, the probability of collision is 0% using md5.
If you use xxhash64,
Assuming that xxhash64 produce a 64-bit hash. You will get this graph.
According to this picture, you can see that if the collision percentage is 50%, you need at least 5 billion of hashes. Two of the 5 billion of hashes can have an odd of 1/2 to have the same hashes!!! If you have around 12 billion of hashes there is 100% of chance that the hashes collide.
If you are using hundred millions of hashed keys, the probability of collision is 0.033% using xxhash64.
This link explains why md5 or fast hash method are not secure.
I'm filling an array with random numbers using $blockhash[$i] = rand().time().rand()
Then, for each random number in that array I calculate the correspondent SHA512
$SecretKey = "60674ccb549f1988439774adb82ff187e63a2dfd403a0dee852e4e4eab75a0b3";
$sha = hash_hmac('sha512', $value, $SecretKey);
Split it:
$pool = str_split($sha, 2);
Then I get the first number from the $pool array, convert hex to dec and limit it within 1 and 50:
$dec = hexdec($pool[0]) % 50 + 1;
The problem is that the numbers are not that random and I don't know why. I'm counting the frequency for each number from 1 to 50 but the numbers 1,2,3,4,5 and 6 are coming up often than the others. See graph
Why is this happening and how to fix it? Thanks!
the 2 hex characters you are converting to decimal will be in the range of 0-255. you mod that by 50 and add 1 making 1-6 (range(0-5)+1) occur 6 times over 1-256 while every other number occurs only 5 times. This would account for a ~20% increase in those numbers coming up.
You get 1-6 more often because you fetch two hexadecimal digits from the hash. That's one byte, so it can store values from 0 to 255. Then you use modulo 50. In result you get ranges 0-49, 50-99, 100-149, 150-199, 200-249 and... 250-255. This last one is responsible for extra prevalence of 1-6 in your results.
Solution: just use mt_rand(1,50);
[edit]
If you really need to convert a number from 0-255 range to 1-50 range, the solution would be scaling and rounding.
$number = round(($byteValue)/(255/49))+1;
Neither rand() or mt_rand() generate truly random values.
As the manual states:
This function does not generate cryptographically secure values, and should not be used for cryptographic purposes. If you need a cryptographically secure value, consider using openssl_random_pseudo_bytes() instead.
See Better Random Generating PHP for an StackOverflow question that points the same issue and has some good answers.
I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?
Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.
You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions
Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28
use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%
I write this simple line to get random & unique code each time (just 8 characters):
echo substr(md5(uniqid(rand(), true)),0,8);
Output:
077331e5
5af425b1
0fc7dcf2
...
I ask if I'll never get a collision (duplicate). Or that can happen.
BS:
It's better to use time()?
echo substr(md5(uniqid(time(), true)),0,8);
Hashes can have collisions. By taking a substring of the hash you are just upping the chance of that happening.
Regardless of what you feed into md5(), by doing the substring, you're eliminating a large part of md5's output and constricting the range of possible hashes. md5 outputs a 128bit string, and you're limiting it to 32bits, So you've got from a 1 in 1x10^38 to 1 in 4 billion chance of a collision.
Your "unique code" is a string of eight hexadecimal digits, and thus it has 4294967296 possible values. You are thus guanteed to get a duplicate of an earlier code by the 4294967297th time you run it.
PHP has a method to provide unique Ids called uniqid()
You stand a fair chance of your 8 char MD5 being unique but as with any random string the shorter you make the more likely you are to have a collision.
Short answer: it can happen. There's a discussion here about the collision space of MD5 that you might want to check out. Doing a substring of the MD5 will make the collision space much, much larger.
A better solution may be the answer proposed here, possibly checking it against other unique IDs that you've generated.
Your code returns part of a hash. Hashes are for hashing, thus you can not guarantee any pattern within the results (eg. uniqueness).
Also, you are getting only part of a hash, and each letter from a hash is hexadecimal (from 0 to 9 or from a to b - 16 possibilities). It needs only a simple calculation:
16 ^ 8 = 4 294 967 296
to find how many unique values can your code generate. This number (4 294 967 296) means, that if you use this function more thatn 4 294 967 296 times, the value generated with it surely will not be unique. Of course it is certain, that in this case the number of iterations will not be unique after applying it to smaller number of values.