Unique ID with PHP - php

I write this simple line to get random & unique code each time (just 8 characters):
echo substr(md5(uniqid(rand(), true)),0,8);
Output:
077331e5
5af425b1
0fc7dcf2
...
I ask if I'll never get a collision (duplicate). Or that can happen.
BS:
It's better to use time()?
echo substr(md5(uniqid(time(), true)),0,8);

Hashes can have collisions. By taking a substring of the hash you are just upping the chance of that happening.

Regardless of what you feed into md5(), by doing the substring, you're eliminating a large part of md5's output and constricting the range of possible hashes. md5 outputs a 128bit string, and you're limiting it to 32bits, So you've got from a 1 in 1x10^38 to 1 in 4 billion chance of a collision.

Your "unique code" is a string of eight hexadecimal digits, and thus it has 4294967296 possible values. You are thus guanteed to get a duplicate of an earlier code by the 4294967297th time you run it.

PHP has a method to provide unique Ids called uniqid()
You stand a fair chance of your 8 char MD5 being unique but as with any random string the shorter you make the more likely you are to have a collision.

Short answer: it can happen. There's a discussion here about the collision space of MD5 that you might want to check out. Doing a substring of the MD5 will make the collision space much, much larger.
A better solution may be the answer proposed here, possibly checking it against other unique IDs that you've generated.

Your code returns part of a hash. Hashes are for hashing, thus you can not guarantee any pattern within the results (eg. uniqueness).
Also, you are getting only part of a hash, and each letter from a hash is hexadecimal (from 0 to 9 or from a to b - 16 possibilities). It needs only a simple calculation:
16 ^ 8 = 4 294 967 296
to find how many unique values can your code generate. This number (4 294 967 296) means, that if you use this function more thatn 4 294 967 296 times, the value generated with it surely will not be unique. Of course it is certain, that in this case the number of iterations will not be unique after applying it to smaller number of values.

Related

how unique is a portion of md5?

I'm having a question regarding the uniqueness of md5 function.
I know that md5 (with microtime value) are not unique, however, they are pretty unique :)
How can I calculate the probability of a collision between two portions of an md5 hashes?
For example: The following in php that generates a 8 chars string from md5 result:
substr(md5(microtime()), 0, 8);
A second scenario - What if the index of it is unique (so it gets a different portion of the hash each time)?
substr(md5(microtime()), rand(0, 32), 8);
There are 2^32 combinations of 8 hexadecimal digits. Even if they are completely random, you can only generate about 65000 such strings, on average, before you get 2 that are the same.
md5(), using a random index or not, doesn't significantly change anything as long as all the microtime() values use use are unique. But, if you are generating these too fast, or across many machines, then the situation is much much worse, because there's a good chance you could end up using the same microtime() value twice.
As you are asking about uniqueness of your string, it's actually a probability. Means as much string character you will use and as much the length of random string you make will get less chances of getting similar random string.
So, to get unique string you need to store string in your DB and compare with random string, if you found similar then again go for new fresh string , until you get unique string.
It depends on how many "sub-hashes" you are going to generate and how many bits you're keeping from the original MD5 hash (length of a "sub-hash"). If you generate just 1 sub-hash and keep just 1 bit then no collision at all. If you generate 2 sub-hashes expect 50% collision. Use 2 bits and the odds are 25%. You do the math. Refer to the birthday paradox for more info

Hashing Birthday Paradox

So I am working on a piece of code that computes the hashes of 2^4 sets of 3 random prime numbers (less than 2^8). Then keep selecting sets of 3 composite numbers (less than 2^8) until there is a set of {c1, c2, c3} with a hash value that matches one of the previous hashes (the prime ones), that set would be known as {p1,p2,p3}.
From my understanding the birthday attack is basically finding two functions that provide the same result. So I would create 2 functions? One for the prime numbers and then another for composite? What would the best way of doing this be? I am thinking PHP as the language.
Any help would be greatly appreciated.
I think the premise is looking for a set of any 3 numbers < 2^8 that produces the same hash value as a set of 3 prime numbers using the same hash function.
Not stated is the range of the hash value.
The birthday attack is based on the fact that since the range of the hash value is limited, a brute force method that tries hashing all combinations of 3 numbers < 2^8 is likely to produce some collisions with valid hash values well before actually trying all possible combinations. However, in this case, trying all combinations of 3 numbers < 2^8 only takes 16777216 loops, so a complete brute force approach can be used.
The program could create a histogram of all the possible hash values . Since there are only 54 primes < 2^8, generating the histogram for all valid inputs (3 primes) would take 54^3 = 157464 loops.
Checking for collisions using all sets of 3 numbers < 2^8 would take 2^24 = 16777216 loops, which shouldn't take too long depending on the hash algorithm.

Encoding/Compressing a large integer into alphanumeric value

I have a very large integer 12-14 digits long and I want to encrypt/compress this to an alphanumeric value so that the integer can be recovered later from the alphanumeric value. I tried to convert this integer using a 62 base and tried to map those values to a-zA-Z0-9, but the value generated from this is 7 characters long. This length is still long enough and I want to convert to about 4-5 characters.
Is there a general way to do this or some method in which this can be done so that recovering the integer would still be possible? I am asking the mathematical aspects here but I would be programming this in PHP and I recently started programming in php.
Edit:
I was thinking in terms of assigning a masking bit and using this in a fashion to generate less number of Chars. I am aware of the fact that the range is not enough and that is the reason I was focusing on using a mathematical trick or a way of representation. The 62 base was an Idea that I already applied but is not working out.
14 digit decimal numbers can express 100,000,000,000,000 values (1014).
5 characters of a 62 character alphabet can express 916,132,832 values (625).
You cannot cram the equivalent number of values of a 14 digit number into a 5 character base 62 string. It's simply not possible to express each possible value uniquely. See http://en.wikipedia.org/wiki/Pigeonhole_principle. Even base 64 with 7 characters is not enough (only 4,398,046,511,104 possible values). In fact, if you target a 5 character short string you'd need to compensate by using a base 631 alphabet (6315 = 100,033,806,792,151).
Even compression doesn't help you. It would mean that two or more numbers would need to compress to the same compressed string (because there aren't enough possible unique compressed values), which logically means it's impossible to uncompress them into two different values.
To illustrate this very simply: Say my alphabet and target "string length" consists of one bit. That one bit can be 0 or 1. It can express 2 unique possible values. Say I have a compression algorithm which compresses anything and everything into this one bit. ... How could I possibly uncompress 100,000,000,000,000 unique values out of that one bit with two possible values? If you'd solve that problem, bandwidth and storage concerns would immediately evaporate and you'd be a billionaire.
With 95 printable ASCII characters you can switch to base 95 encoding instead of 62:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
That way an integer string of length X can be compressed into length Y base 95 string, where
Y = X * log 10/ log 95 = roughly X / 2
which is pretty good compression. So from length 12 you get down to 6. If the purpose of compression is to save the bandwidth by using JSON, then base 92 can be good choice (excluding ",\,/ that become escaped in JSON).
Surely you can get better compression but the price to pay is a larger alphabet. Just replace 95 in the above formula by the number of symbols.
Unless of course, you know the structure of your integers. For instance, if they have plenty of zeroes, you can base your compression on this knowledge to get much better results.
because the pigeon principle you will end up with some values that get compressed and other values that get expanded. It simply impossible to create a compression algorithm that compress every possible input string (i.e. in your case your numbers).
If you force the cardinality of the output set to be smaller than the cardinality of the input set you'll get collisions (i.e. more input strings get "compressed" to the same compressed binary string). A compression algorithm should be reversible, right? :)

RijndaelManaged.CreateEncryptor key expansion

There are two ways to specify a key and an IV for a RijndaelManaged object. One is by calling CreateEncryptor:
var encryptor = rij.CreateEncryptor(Encoding.UTF8.GetBytes(key), Encoding.UTF8.GetBytes(iv)));
and another one by directly setting Key and IV properties:
rij.Key = "1111222233334444";
rij.IV = "1111222233334444";
As long as the length of the Key and IV is 16 bytes, both methods produce the same result. But if your key is shorter than 16 bytes, the first method still allows you to encode the data and the second method fails with an exception.
Now this may sound like an absolutely abstract question, but I have to use PHP & the key which is only 10 bytes long in order to send an encrypted message to a server which uses the first method.
So the question is: How does CreateEncryptor expand the key and is there a PHP implementation? I cannot alter the C# code so I'm forced to replicate this behaviour in PHP.
I'm going to have to start with some assumptions. (TL;DR - The solution is about two-thirds of the way down but the journey is way cooler).
First, in your example you set IV and Key to strings. This can't be done. I'm therefore going to assume we call GetBytes() on the strings, which is a terrible idea by the way as there are less potential byte values in usable ASCII space than there are in all 256 values in a byte; that's what GenerateIV() and GenerateKey() are for. I'll get to this at the very end.
Next I'm going to assume you're using the default block, key and feedback size for RijndaelManaged: 128, 256 and 128 respectively.
Now we'll decompile the Rijndael CreateEncryptor() call. When it creates the Transform object it doesn't do much of anything with the key at all (except set m_Nk, which I'll come to later). Instead it goes straight to generating a key expansion from the bytes it is given.
Now it gets interesting:
switch (this.m_blockSizeBits > rgbKey.Length * 8 ? this.m_blockSizeBits : rgbKey.Length * 8)
So:
128 > len(k) x 8 = 128
128 <= len(k) x 8 = len(k) x 8
128 / 8 = 16, so if len(k) is 16 we can expect to switch on len(k) x 8. If it's more, then it will switch on len(k) x 8 too. If it's less it will switch on the block size, 128.
Valid switch values are 128, 192 and 256. That means it will only fall to default (and throw an exception) if it's over 16 bytes in length and not a valid block (not key) length of some sort.
In other words, it never checks against the key length specified in the RijndaelManaged object. It goes straight in to the key expansion and starts operating at the block level, as long as the key length (in bits) is one of 128, 192, 256 or less than 128. This is actually a check against the block size, not the key size.
So what happens now that we've patently not checked the key length? The answer has to do with the nature of the key schedule. When you enter a key in to Rijndael, the key needs to be expanded before it can be used. In this case, it's going to be expanded to 176 bytes. In order to accomplish this, it uses an algorithm which is specifically designed to turn a short byte array in to much longer byte array.
Part of that involves checking the key length. A bit more decompilation fun and we find that this defined as m_Nk. Sounds familiar?
this.m_Nk = rgbKey.Length / 4;
Nk is 4 for a 16-byte key, less when we enter shorter keys. That's 4 words, for anyone wondering where the magic number 4 came from. This causes a curious fork in the key scheduler, there's a specific path for Nk <= 6.
Without going too deep in to the details, this actually happens to 'work' (ie. not crash in a fireball) with a key length less than 16 bytes... until it gets below 8 bytes.
Then the entire thing crashes spectacularly.
So what have we learned? When you use CreateEncryptor you are actually throwing a completely invalid key straight in to the key scheduler and it's serendipity that sometimes it doesn't outright crash on you (or a horrible contractual integrity breach, depending on your POV); probably an unintended side effect of the fact there's a specific fork for short key lengths.
For completeness sake we can now look at the other implementation where you set the Key and IV in the RijndaelManaged object. These are stored in the SymmetricAlgorithm base class, which has the following setter:
if (!this.ValidKeySize(value.Length * 8))
throw new CryptographicException(Environment.GetResourceString("Cryptography_InvalidKeySize"));
Bingo. Contract properly enforced.
The obvious answer is that you cannot replicate this in another library unless that library happens to contain the same glaring issue, which I'm going to a call a bug in Microsoft's code because I really can't see any other option.
But that answer would be a cop out. By inspecting the key scheduler we can work out what's actually happening.
When the expanded key is initialised, it populates itself with 0x00s. It then writes to the first Nk words with our key (in our case Nk = 2, so it populates the first 2 words or 8 bytes). Then it enters a second stage of expanding upon that by populating the rest of the expanded key beyond that point.
So now we know it's essentially padding everything past 8 bytes with 0x00, we can pad it with 0x00s right? No; because this shifts the Nk up to Nk = 4. As a result, although our first 4 words (16 bytes) will be populated as we expect, the second stage will begin expanding at the 17th byte, not the 9th!
The solution then is utterly trivial. Rather than padding our initial key with 6 additional bytes, just chop off the last 2 bytes.
So your direct answer in PHP is:
$key = substr($key, 0, -2);
Simple, right? :)
Now you can interop with this encryption function. But don't. It can be cracked.
Assuming your key uses lowercase, uppercase and digits you have an exhaustive search space of only 218 trillion keys.
62 bytes (26 + 26 + 10) is the search space of each byte because you're never using the other 194 (256 - 62) values. Since we have 8 bytes, there are 62^8 possible combinations. 218 trillion.
How fast can we try all the keys in that space? Let's ask openssl what my laptop (running lots of clutter) can do:
Doing aes-256 cbc for 3s on 16 size blocks: 12484844 aes-256 cbc's in 3.00s
That's 4,161,615 passes/sec. 218,340,105,584,896 / 4,161,615 / 3600 / 24 = 607 days.
Okay, 607 days isn't bad. But I can always just fire up a bunch of Amazon servers and cut that down to ~1 day by asking 607 equivalent instances to calculate 1/607th of the search space. How much would that cost? Less than $1000, assuming that each instance was somehow only as efficient as my busy laptop. Cheaper and faster otherwise.
There is also an implementation that is twice the speed of openssl1, so cut whatever figure we've ended up with in half.
Then we've got to consider that we'll almost certainly find the key before exhausting the entire search space. So for all we know it might be finished in an hour.
At this point we can assert if the data is worth encrypting, it's probably worth it to crack the key.
So there you go.

How many bytes are unique enough for twitter?

I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?
Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.
You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions
Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28
use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%

Categories