Where can I find xxhash64 and md5 collision probability statistics? - php

I dont find any info about % of collisions for xxhash64.
I'm going to use it for cache system (to generate hash keys which need to be unique, about a hundreds millions).
Now i use md5, but i don't need cryptographic property.
So i need some info, to decide does is it a good decision for my task.
In best case - comparison of the number of collisions between md5 and xxHash64.

You can calculate yourself by using the birthday problem.
In general the mathematical expression that gives you the probability of hash function is :
p(k) = 1 - exp(-k(k-1)/2N, k (number of hashes) randomly generated values, where each value is a non-negative integer less than N (number of possible hashes):
N = 2^(number of bit), example for md5 it is 2^128, or 2^32 for 32 bit-hash
If you use md5
will produce a 128-bit hash value, by applying this formula you get this 'S' graph. This graph explains, for example, in order to get a collison probability of 50% (0.5), you need at least 21 000 000 trillion of hashes or 21 quintillion of hashes!!!! If you we use less than, for instance 1 billion of hashes, the probability of collision is negligible.
If you are using hundred millions of hashed keys, the probability of collision is 0% using md5.
If you use xxhash64,
Assuming that xxhash64 produce a 64-bit hash. You will get this graph.
According to this picture, you can see that if the collision percentage is 50%, you need at least 5 billion of hashes. Two of the 5 billion of hashes can have an odd of 1/2 to have the same hashes!!! If you have around 12 billion of hashes there is 100% of chance that the hashes collide.
If you are using hundred millions of hashed keys, the probability of collision is 0.033% using xxhash64.
This link explains why md5 or fast hash method are not secure.

Related

Does hashing a random value plus an auto increment number ensure uniqueness?

I'm trying to generate a unique order number for my ecommerce application, this is my code:
<?php
$bytes = random_bytes(3);
$random_hash = bin2hex($bytes);
$order_num = $random_hash . "1";
echo strtoupper(hash('crc32b', $order_num));
The order number (in the example is 1), is going to be an auto-increment value retrieved from MySQL.
Does this ensure me uniqueness?
I wanted a short max 8-10 chars unique final value.
An only numbers solution would be fine too.
As far as I know, most hash algorithms make no guarantee of when collisions might occur, so you're probably just as likely to get a collision with your proposed code as using the random part on its own.
If the auto-increment part is unique, and the random part is just to avoid guesses, you could just concatenate the two parts together (i.e. everything in your example before the hash call). That way if the same random number comes up twice, it will have different numbers on the end.
If that results in something too long, you could do something with base_convert or asc to convert the number into a shorter representation.
The hash function will not provide any uniqueness to the id, it only obfuscates the id a bit.
If you have lets say 100 possible values, you would get 100 possible hashes from them, no more. If an attacker wants to brute-force the hashes, he can pick the 100 possible hashes and try them.
In your case with 3 bytes of randomness, you would not get all possible combinations before you get a duplicate. So the same random number would be generated much earlier than with 3 bytes of possible combinations.
There are two common approaches when it comes to unique ids:
You let the database automatically increment the id, this makes sure that the id is unique.
You generate a UUID (global id with 16 bytes) which offers such a huge keyspace, that a duplicate is extremely unlikely. In practice one can neglate the possiblilty of duplicates.
The UUID has a lot of advantages and one disadvantage:
(+) UUID's can work decentralized e.g. in an offline scenario.
(+) One can generate the id before it is inserted in the database, so one has not to wait before the row is created in the db.
(+) The ids are not deterministic, so an attacker cannot guess the next id.
(-) They use more storage space and are a bit slower when searching.

how unique is a portion of md5?

I'm having a question regarding the uniqueness of md5 function.
I know that md5 (with microtime value) are not unique, however, they are pretty unique :)
How can I calculate the probability of a collision between two portions of an md5 hashes?
For example: The following in php that generates a 8 chars string from md5 result:
substr(md5(microtime()), 0, 8);
A second scenario - What if the index of it is unique (so it gets a different portion of the hash each time)?
substr(md5(microtime()), rand(0, 32), 8);
There are 2^32 combinations of 8 hexadecimal digits. Even if they are completely random, you can only generate about 65000 such strings, on average, before you get 2 that are the same.
md5(), using a random index or not, doesn't significantly change anything as long as all the microtime() values use use are unique. But, if you are generating these too fast, or across many machines, then the situation is much much worse, because there's a good chance you could end up using the same microtime() value twice.
As you are asking about uniqueness of your string, it's actually a probability. Means as much string character you will use and as much the length of random string you make will get less chances of getting similar random string.
So, to get unique string you need to store string in your DB and compare with random string, if you found similar then again go for new fresh string , until you get unique string.
It depends on how many "sub-hashes" you are going to generate and how many bits you're keeping from the original MD5 hash (length of a "sub-hash"). If you generate just 1 sub-hash and keep just 1 bit then no collision at all. If you generate 2 sub-hashes expect 50% collision. Use 2 bits and the odds are 25%. You do the math. Refer to the birthday paradox for more info

Hashing Birthday Paradox

So I am working on a piece of code that computes the hashes of 2^4 sets of 3 random prime numbers (less than 2^8). Then keep selecting sets of 3 composite numbers (less than 2^8) until there is a set of {c1, c2, c3} with a hash value that matches one of the previous hashes (the prime ones), that set would be known as {p1,p2,p3}.
From my understanding the birthday attack is basically finding two functions that provide the same result. So I would create 2 functions? One for the prime numbers and then another for composite? What would the best way of doing this be? I am thinking PHP as the language.
Any help would be greatly appreciated.
I think the premise is looking for a set of any 3 numbers < 2^8 that produces the same hash value as a set of 3 prime numbers using the same hash function.
Not stated is the range of the hash value.
The birthday attack is based on the fact that since the range of the hash value is limited, a brute force method that tries hashing all combinations of 3 numbers < 2^8 is likely to produce some collisions with valid hash values well before actually trying all possible combinations. However, in this case, trying all combinations of 3 numbers < 2^8 only takes 16777216 loops, so a complete brute force approach can be used.
The program could create a histogram of all the possible hash values . Since there are only 54 primes < 2^8, generating the histogram for all valid inputs (3 primes) would take 54^3 = 157464 loops.
Checking for collisions using all sets of 3 numbers < 2^8 would take 2^24 = 16777216 loops, which shouldn't take too long depending on the hash algorithm.

Hash collision worries

If I have a system where a hash is generated out of a total permutation of 1 million possibilities. If there's a 10% chance of a collision, should I worry about the generating algorithm running 5 times?
I have a system similar to jsfiddle, where a user can "save" a file on my server. Now I'm using '23456789abcdefghijkmnopqrstuvwxyz' which is 33 chars, and the file is 4 chars long, for a total of 33^4 = 1,185,921 possibilities.
The "filename" is generated randomly and if there's a collision it reruns to get another filename. Using a birthday paradox calculator I can see that after I have 500 entries I have a 10% chance of a collision.
What are the chances that I'll get a collision more than 5 times in a row? what about 4?
Is there any way to figure this out? Should I worry about it? What happens after 5000 entries?
Is there a program out there that can figure this out with any arbitary inputs?
I don't think that the birthday paradox calculations apply. There's a difference between the odds of 500 random numbers out of 1185921 being all different and the odds of one new number being different once you have 500 known unique numbers.
If you have 500 assigned numbers and generate a new number at random, it will have odds of 500/1185921 of being a collision. With 500 names taken, the chances of 4 collisions in a row are (500/1185921)4 < 10-13. With 5000 existing file names, the odds of a new name being a collision are 5000/1185921, and the chance of 4 collisions in a row are < 10-9.
My math is a little rusty so bear with me.
The chance of getting x collisions in a row is simply:
chance of collision ^ x;
Where the chance of collision is:
entries/space (which is 500/1185921 or 0.04%).
You can see above that this will get worse with the more entries (and better with a bigger space).
Also note the birthday paradox is perhaps not quite what you want. The 10% chance is the chance that any two entries will have had a collision, not the chance of a collision for the next entry.

Unique ID with PHP

I write this simple line to get random & unique code each time (just 8 characters):
echo substr(md5(uniqid(rand(), true)),0,8);
Output:
077331e5
5af425b1
0fc7dcf2
...
I ask if I'll never get a collision (duplicate). Or that can happen.
BS:
It's better to use time()?
echo substr(md5(uniqid(time(), true)),0,8);
Hashes can have collisions. By taking a substring of the hash you are just upping the chance of that happening.
Regardless of what you feed into md5(), by doing the substring, you're eliminating a large part of md5's output and constricting the range of possible hashes. md5 outputs a 128bit string, and you're limiting it to 32bits, So you've got from a 1 in 1x10^38 to 1 in 4 billion chance of a collision.
Your "unique code" is a string of eight hexadecimal digits, and thus it has 4294967296 possible values. You are thus guanteed to get a duplicate of an earlier code by the 4294967297th time you run it.
PHP has a method to provide unique Ids called uniqid()
You stand a fair chance of your 8 char MD5 being unique but as with any random string the shorter you make the more likely you are to have a collision.
Short answer: it can happen. There's a discussion here about the collision space of MD5 that you might want to check out. Doing a substring of the MD5 will make the collision space much, much larger.
A better solution may be the answer proposed here, possibly checking it against other unique IDs that you've generated.
Your code returns part of a hash. Hashes are for hashing, thus you can not guarantee any pattern within the results (eg. uniqueness).
Also, you are getting only part of a hash, and each letter from a hash is hexadecimal (from 0 to 9 or from a to b - 16 possibilities). It needs only a simple calculation:
16 ^ 8 = 4 294 967 296
to find how many unique values can your code generate. This number (4 294 967 296) means, that if you use this function more thatn 4 294 967 296 times, the value generated with it surely will not be unique. Of course it is certain, that in this case the number of iterations will not be unique after applying it to smaller number of values.

Categories