I am writing an affiliate system, and I want to generate a unique 32 character wide token, from the url.
The problem is that a URL can be up to 128 chars long (IIRC). Is there a way that I can create a unique 32 char wide key/token from a given URL, without any 'collisions'?
I am not sure if this is an encoding, encryption or hashing problem (probably, a mixture of all three).
I will be implementing this 'mapping function' using PHP, since that is the language I am using to build this particular system. Any suggestions on how to go about doing this?
Is it even possible to map a 128 char string into a 32 char string uniquely (i.e. no collisions?) ...
[Edit]
I just did some reading up, and found that the max length of urls is actually, something in the order of 2K. However, I am not concerned about 'silly' edge cases like that. I am pretty sure that 99.9% of the time, my imposed limit of 128 chars should be sufficient.
Is it even possible to map a 128 char
string into a 32 char string uniquely
(i.e. no collisions?) ...
In part. You can use a hash function like md5 or sha1. That's what they were built to do.
MD5 generates a 32 char string, and SHA1 generates a 40 char string.
Of course you can't guarantee that there won't be collisions. That's impossible since the message space is too large for your hashes (there are 21024 messages vs 2128 possible hashes if you are using MD5), but these functions are meant to be collision resistant and hard to reverse.
Wikipedia references:
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/MD5
http://en.wikipedia.org/wiki/SHA-1
Is it even possible to map a 128 char string into a 32 char string uniquely (i.e. no collisions?) ...
That depends on the alphabet being used for both input and output. If your resulting 32 char hash is limited to an alphabet of a-z, you can encode a maximum of 26^32 = 1.901722×10^45 values in it. A URL can consist of at least a-z and quite a number of other characters, so can contain at least 26^128 = 1.307942×10^181 values. So, an alphabet of 26 characters is not enough.
Using a-zA-Z0-9 you can encode 62^32 = 2.272658×10^57 unique values, which is still not enough. Even an alphabet of 100 characters gives you only 100^32 = 1.0×10^64 possible values.
Depending on what exactly you want to do, you should either increase the length of the hash or rethink the overall approach.
Related
I'm getting hexadecimal digits to generate unique random activators link, such as:
hostname/account/confirm/$randomHex
By searching, actually, my random hex using PHP could be:
bin2hex( openssl_random_pseudo_bytes(16) )
Above generates a string with 32 hex digits and i would appreciate to use a less length as 12 hex digits.
Considering the power processing of computers, what's the more secure minimum size of the hexadecimal that i'm able to use?
Considering the power processing of computers, what's the more secure minimum size of the hexadecimal that i'm able to use?
This is actually an easy number to calculate, if you have a threat model in place.
Based on the URL you provided, it seems you're generating a URL for email ownership verification. This is decidedly needy than, say, a password reset URL.
If you rate limit bad attempts (i.e. block their IP address from being able to attempt again for 24 hours), you can get by with 8 hex characters (32 bits) sheer chance means they'll be able to guess a valid confirmation link after 65,536 tries with 50% probability. (The Birthday paradox.) Pulling this off would also require 65,536 IP addresses just to blindly confirm someone's email address (probably not their own).
HOWEVER!
As stated above, if you are using this for, e.g. a recovery feature (I forgot my password), don't skimp out on string length. 128 bits (32 hex, 16 raw binary) should be considered a lower bound. I'd say shoot for 256 bits just to be safe.
Above generates a string with 32 hex digits and i would appreciate to use a less length as 12 hex digits.
If you want to increase the security of a string given a fixed length, the only way to do so is to increase the number of possible values for each character in the string.
Even if you were using raw binary, which you're not, the upper limit of 11 characters is 88 bits of entropy. Specifying hex cuts you down to 44 (but most likely 40, since you'd probably write bin2hex(random_bytes(5)) here).
If you want to securely generate an fixed-size string with an arbitrary alphabet, check out this StackOverflow answer.
16 randomly generated bytes gives 128bits of entropy. A key with 128bits of entropy is uncrackable using an offline brute force attack. Even with every computer in the world working on cracking it.
However, you are looking to prevent an online brute force attack, which is much slower. If you wanted 12 hex characters, this would be 6 bytes and therefore 48 bits of entropy. This gives you 281,474,976,710,656 possibilities. If your site takes 0.25* seconds to respond, this would take 2^47 * 0.25 = 35,184,372,088,832 seconds to brute force on average by making requests to your site (1.116 million years).
You're safe with 48 bits.
*In reality this would be a parallel attack, so the attacker would not have to wait for a response if all they're trying to do is validate an account. However, there will be a rate limit to any system, slowing the attack. Adjust the figures to suit your system as necessary.
I've just had an interesting little problem.
Using Laravel 4, I encrypt some entries before adding them to a db, including email address.
The db was setup with the default varchar length of 255.
I've just had an entry that encrypted to 309 characters, blowing up the encryption by cutting off the last 50-odd characters in the db.
I've (temporarily) fixed this by simply increasing the varchar length to 500, which should - in theory - cover me from this, but I want to be sure.
I'm not sure how the encryption works, but is there a way to tell what maximum character length to expect from the encrypt output for the sake of setting my database?
Should I change my field type from varchar to something else to ensure this doesn't happen again?
Conclusion
First, be warned that there has been quite a few changes between 4.0.0 and 4.2.16 (which seems to be the latest version).
The scheme starts with a staggering overhead of 188 characters for 4.2 and about 244 for 4.0 (given that I did not forget any newlines and such). So to be safe you will probably need in the order of 200 characters for 4.2 and 256 characters for 4.0 plus 1.8 times the plain text size, if the characters in the plaintext are encoded as single bytes.
Analysis
I just looked into the source code of Laravel 4.0 and Laravel 4.2 with regards to this function. Lets get into the size first:
the data is serialized, so the encryption size depends on the size of the type of the value (which is probably a string);
the serialized data is PKCS#7 padded using Rijndael 256 or AES, so that means adding 1 to 32 bytes or 1 to 16 bytes - depending on the use of 4.0 or 4.2;
this data is encrypted with the key and an IV;
both the ciphertext and IV are separately converted to base64;
a HMAC using SHA-256 over the base64 encoded ciphertext is calculated, returning a lowercase hex string of 64 bytes
then the ciphertext consists of base64_encode(json_encode(compact('iv', 'value', 'mac'))) (where the value is the base 64 ciphertext and mac is the HMAC value, of course).
A string in PHP is serialized as s:<i>:"<s>"; where <i> is the size of the string, and <s> is the string (I'm presuming PHP platform encoding here with regards to the size). Note that I'm not 100% sure that Laravel doesn't use any wrapping around the string value, maybe somebody could clear that up for me.
Calculation
All in all, everything depends quite a lot on character encoding, and it would be rather dangerous for me to make a good estimation. Lets assume a 1:1 relation between byte and character for now (e.g. US-ASCII):
serialization adds up to 9 characters for strings up to 999 characters
padding adds up to 16 or 32 bytes, which we assume are characters too
encryption keeps data the same size
base64 in PHP creates ceil(len / 3) * 4 characters - but lets simplify that to (len * 4) / 3 + 4, the base 64 encoded IV is 44 characters
the full HMAC is 64 characters
the JSON encoding adds 3*5 characters for quotes and colons, plus 4 characters for braces and comma's around them, totaling 19 characters (I'm presuming json_encode does not end with a white space here, base 64 again adds the same overhead
OK, so I'm getting a bit tired here, but you can see it at least twice expands the plaintext with base64 encoding. In the end it's a scheme that adds quite a lot of overhead; they could just have used base64(IV|ciphertext|mac) to seriously cut down on overhead.
Notes
if you're not on 4.2 now, I would seriously consider upgrading to the latest version because 4.2 fixes quite a lot of security issues
the sample code uses a string as key, and it is unclear if it is easy to use bytes instead;
the documentation does warn against key sizes other than the Rijndael defaults, but forgets to mention string encoding issues;
padding is always performed, even if CTR mode is used, which kind of defeats the purpose;
Laravel pads using PKCS#7 padding, but as the serialization always seems to end with ;, that was not really necessary;
it's a nice thing to see authenticated encryption being used for database encryption (the IV wasn't used, fixed in 4.2).
#MaartenBodewes' does a very good job at explaining how long the actual string probably will be. However you can never know it for sure, so here are two options to deal with the situation.
1. Make your field text
Change the field from a limited varchar to an "self-expanding" text. This is probably the simpler one, and especially if you expect rather long input I'd definitely recommend this.
2. Just make your varchar longer
As you did already, make your varchar longer depending on what input length you expect/allow. I'd multiply by a factor of 5.
But don't stop there! Add a check in your code to make sure the data doesn't get truncated:
$encrypted = Crypt::encrypt($input);
if(strlen($encrypted) > 500){
// do something about it
}
What can you do about it?
You could either write an error to the log and add the encrypted data (so you can manually re-insert it after you extended the length of your DB field)
Log::error('An encrypted value was too long for the DB field xy. Length: '.strlen($encrypted).' Data: '.$encrypted);
Obviously that means you have to check the logs frequently (or send them to you by mail) and also that the user could encounter errors while using the application because of the incorrect data in your DB.
The other way would be to throw an exception (and display an error to the user) and of course also write it to the log so you can fix it...
Anyways
Whether you choose option 1 or 2 you should always restrict the accepted length of your input fields. Server side and client side.
I'm learning about PHP's crypt() function and have been running some tests with it. According to this post, I should use a salt that's 22 characters long. I can, however, use a string that's 23 characters long with some limitations. When I use a 22 character long string I always get an outcome of '$2y$xxStringStringStringStri.HashHashHashHashHashHashHashHas'. I know the period is just part of the salt.
It seems that if I use 23 characters instead of just 22, I can successfully generate different hashes, but there is only 4 different outcomes for all 64 characters. The 23rd character "rounds down" to the nearest 1/4th of the 64 character alphabet (e.g. the 23rd character is "W" and rounds down to "O" or any number rounds down to "u")
v---------------v---------------v---------------v---------------
./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890
All four of these crypt functions generate the same salt:
crypt('Test123','$2y$09$AAAAAAAAAAAAAAAAAAAAAq');
crypt('Test123','$2y$09$AAAAAAAAAAAAAAAAAAAAAr');
crypt('Test123','$2y$09$AAAAAAAAAAAAAAAAAAAAAs');
crypt('Test123','$2y$09$AAAAAAAAAAAAAAAAAAAAAt');
But this one is different:
crypt('Test123','$2y$09$AAAAAAAAAAAAAAAAAAAAAu');
So why shouldn't I use the 23rd character when it can successfully generate different outcomes? Is there some kind of glitchy behavior in PHP that should be avoided by not using it?
For clarification on how I'm counting the 23rd character in the salt:
crypt('Test123','$2y$08$ABCDEFGHIJKLMNOPQRSTUV');
// The salt is '$ABCDEFGHIJKLMNOPQRSTUV'
// Which will be treated as '$ABCDEFGHIJKLMNOPQRSTUO'
It has to do with hash collisions. Once you exceed 22 characters your generated hashes are no longer unique depending on the NAMESPACE of the algorithm. To be said another way, more than 22 characters doesn't result in any increased security and can actually decrease your level of security.
$ is not part of the actual salt. It is a separator.
For Blowfish crypt, the format is $2[axy]$log2Rounds$[salt][hash]. You describe it adding a . -- that's because you are missing the last character. Blowfish's salt is 128 bits. You could use only 126, yes, but you are just unnecessarily weakening the salt.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
php short hash
I need to generate a short hash. The shortest possible from urls say under 6 characters.
I need them to be unique just for the same domain, so a hash from
www.example.com/category/sth/blablabla must be different than one from
www.example.com/category2/sth/blabla but not from:
www.example2.com/category/sth/blablabla
Would using md5($url) and then picking some 5 characters out of that result (for example the first, last, middle and 2 other characters) give and unique id?
Would this abbreviated hash be unique as well?
A hash is not unique by definition. It's mathematically impossible to get a unique hash for something longer than the hash, unless it does not vary fully, which is the case for URLs but you cannot exploit it generally. Alternatively, you could use a simple incrementing ID, but that won't allow you to recognize matching URLs.
Either use a really long hash (at least 10 characters, ideally using upper and lower case letters), or accept collisions and handle them appropriately. Which is how actual hash tables work.
For low probability of collisions you can use universal hashing techniques. For example, choose a prime number P. Then for each character of the URL choose a random in the interval [0, P). Compute the hash of the URL as SUM(a[i]*c[i]) mod P, where c[i] is a character in the original URL. Then take the string containing the digits of the obtained integer as the hash.
Read more in this paper: http://www.cs.cmu.edu/~avrim/451/lectures/lect0929.pdf.
Yes, a small change in a URL will change pretty much every character in a good hash. MD5 or SHA1 is probably fine for this. Hence, take the first X characters - and you won't get any improvement by choosing the last X characters, or the first/last/middle. They're all good!
Obviously the more characters you put in your partial hash, the less likely you are to get collisions.
I would try using crc32($url); it will give an integer usually 10-11 digits-long, could be a negative value, but still it will be shorter than 32 chars for md5.
The only problem is that crc32 is not 100% unique, but it's very unlikely that two different URLs will end up with the same checksum (but still there is a possibility).
I don't think I was specific enough last time. Here we go:
I have a hex string:
742713478fb3c36e014d004100440041004
e0041004e00000060f347d15798c9010060
6b899c5a98c9014d007900470072006f007
500700000002f0000001f7691944b9a3306
295fb5f1f57ca52090d35b50060606060606
The last 20 bytes should (theoretically) contain a SHA1 Hash of the first part (complete string - 20 bytes). But it doesn't match for me.
Trying to do this with PHP, but no luck. Can you get a match?
Ticket:
742713478fb3c36e014d004100
440041004e0041004e00000060
f347d15798c90100606b899c5a
98c9014d007900470072006f00
7500700000002f0000001f7691944b9a
sha1 hash of ticket appended to original:
3306295fb5f1f57ca52090d35b50060606060606
My sha1 hash of ticket:
b6ecd613698ac3533b5f853bf22f6eb4afb94239
Here's what is in the ticket and how it's being stored. FWIW, I can pull out username, etc, and spot the various delimiters.
http://www.codeproject.com/KB/aspnet/Forms_Auth_Internals/AuthTicket2.JPG
Edited: I have discovered that the string is padded on the end by the decryption function it goes through before this point. I removed the last 6 bytes and adjusted by ticket and hash accordingly. Still doesn't work, but I'm closer.
Your ticket is being calculated on the hex string itself. Maybe the appended hash is calculated on another representation of the same data?
I think you are getting confused about bytes vs characters.
Internally, php stores every character in a string as a byte. The sha1 hash that PHP generates is a 40 character (40 byte) hexademical representation of the 20-byte binary data, since each binary value needs to be represented by 2 hex characters.
I'm not sure if this is the actual source of your discrepancy, but seeing this misunderstanding makes me wonder if it's related.
Try trimming the string first, its suprisingly easy to have a newline or space on the end that changes the hash completely.
According to this Online SHA1 tool the hash of the given text (after removing new lines and spaces) is
b6ecd613698ac3533b5f853bf22f6eb4afb94239
Idea: Make sure your inputing characters not a hex number to the PHP version.
The problem was that the original was a keyed hash. I had to use hash_hmac() with a validation key rather than sha1() without.