When referring to the length of a hash value such as sha1 or md5 in PHP, is it correct to interpret that as the size of the hash in memory rather than the number of characters present in the literal?
Yes, it does. However, that size is tightly related to the amount of characters in the string -- if you get a raw string, you'll get 1 character per 8 bits; if you get hex digits (the default), you're getting 1 character per 4 bits.
It's the minimum number of bits required to store the hash unambiguously.
>>> len(hashlib.md5('foo').digest()) * 8
128
>>> len(hashlib.sha1('foo').digest()) * 8
160
>>> len(hashlib.sha512('foo').digest()) * 8
512
The principal output of a secure hash function is always defined in bits. So when referring to the output of a hash function a cryptographer always talks about e.g. 128 bits for the broken MD5 algorithm, 160 bits for SHA1 and obviously 256 bits for SHA-256.
Most crypto APIs however only work with bytes. This means that if there is a specific method present to indicate hash size, that more often than not the size in bytes is returned. So that would be 16, 20 and 32 bytes for the above algorithms.
Of course, the bytes are returned in e.g. hexadecimals then the length in characters of the string would be double that. The string length should then return 32, 40 or 64 characters. If that translates to an identical number of bytes depends on the character encoding (e.g. using UTF-16 would double the number of bytes).
Hash functions do have a big internal state, so the number of bytes taken by a running implementation is much higher than number of bits in the output. It is not that high that you would notice on a modern PC though.
Related
Hi I am working on creating an assembler and so I need to take some number and convert it to hex for a branch command. Is there a way to change the amount of bytes returned in the output? We are using 24 bit instructions (6 bytes) and our branch commands use the first byte for op code and second byte for conditional bits, that leaves me 4 bytes for the number. If I have a negative number like -2 I get fffffffffffffffe which is 16 bytes. Is there an easy way to change the output of hexdec() to a specified number of bytes? I know how to do positive numbers as they output the minimum amount of bytes needed so 2 becomes 2 or 15 becomes f.
If I went from integer to binary using decbin I still get 16 bytes. I can not just cut off any leading bytes can I?
Since I don't care about possibility of overflow and I will not get anywhere clear to the 65k number required to need more than 4 bytes I can ignore all bytes after the 4th byte. I would still like to know if there is a way though.
The docs (http://php.net/manual/de/function.crypt.php) for the crypt() function show the following example for an MD5 hash:
$1$rasmusle$rISCgZzpwk3UhDidwXvin0
I understand, that "$1$" is the prefix, which contains the information, that the hash is an MD5 hash.
But how is the rest of the string an MD5 hash? Normally it should be a 32 char string (0-9, a-f), right?
I'm sure, it's a stupid question, but I still want to ask.
Normally it should be a 32 char string (0-9, a-f), right?
That is not correct (at least strictly speaking). Technically, a MD5 hash is a 128 bit numeric value. The form that you are used to is simply a hexadecimal representation of that number. It is often chosen because they are easy to exchange as strings (128-bit integers are difficult to handle. After all, a typical integer variable usually only holds 64 bit). Consider the following examples:
md5("test") in hexadecimal (base 16) representation: 098f6bcd4621d373cade4e832627b4f6
md5("test") in base 64 representation: CY9rzUYh03PK3k6DJie09g==
md5("test") in decimal (base 10) representation: 12707736894140473154801792860916528374
md5("test") in base 27 representation (never used, just because I can and to prove my point): ko21h9o9h8bc1hgmao4e69bn6f
All these strings represent the same numerical value, just in different bases.
I have a very large integer 12-14 digits long and I want to encrypt/compress this to an alphanumeric value so that the integer can be recovered later from the alphanumeric value. I tried to convert this integer using a 62 base and tried to map those values to a-zA-Z0-9, but the value generated from this is 7 characters long. This length is still long enough and I want to convert to about 4-5 characters.
Is there a general way to do this or some method in which this can be done so that recovering the integer would still be possible? I am asking the mathematical aspects here but I would be programming this in PHP and I recently started programming in php.
Edit:
I was thinking in terms of assigning a masking bit and using this in a fashion to generate less number of Chars. I am aware of the fact that the range is not enough and that is the reason I was focusing on using a mathematical trick or a way of representation. The 62 base was an Idea that I already applied but is not working out.
14 digit decimal numbers can express 100,000,000,000,000 values (1014).
5 characters of a 62 character alphabet can express 916,132,832 values (625).
You cannot cram the equivalent number of values of a 14 digit number into a 5 character base 62 string. It's simply not possible to express each possible value uniquely. See http://en.wikipedia.org/wiki/Pigeonhole_principle. Even base 64 with 7 characters is not enough (only 4,398,046,511,104 possible values). In fact, if you target a 5 character short string you'd need to compensate by using a base 631 alphabet (6315 = 100,033,806,792,151).
Even compression doesn't help you. It would mean that two or more numbers would need to compress to the same compressed string (because there aren't enough possible unique compressed values), which logically means it's impossible to uncompress them into two different values.
To illustrate this very simply: Say my alphabet and target "string length" consists of one bit. That one bit can be 0 or 1. It can express 2 unique possible values. Say I have a compression algorithm which compresses anything and everything into this one bit. ... How could I possibly uncompress 100,000,000,000,000 unique values out of that one bit with two possible values? If you'd solve that problem, bandwidth and storage concerns would immediately evaporate and you'd be a billionaire.
With 95 printable ASCII characters you can switch to base 95 encoding instead of 62:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
That way an integer string of length X can be compressed into length Y base 95 string, where
Y = X * log 10/ log 95 = roughly X / 2
which is pretty good compression. So from length 12 you get down to 6. If the purpose of compression is to save the bandwidth by using JSON, then base 92 can be good choice (excluding ",\,/ that become escaped in JSON).
Surely you can get better compression but the price to pay is a larger alphabet. Just replace 95 in the above formula by the number of symbols.
Unless of course, you know the structure of your integers. For instance, if they have plenty of zeroes, you can base your compression on this knowledge to get much better results.
because the pigeon principle you will end up with some values that get compressed and other values that get expanded. It simply impossible to create a compression algorithm that compress every possible input string (i.e. in your case your numbers).
If you force the cardinality of the output set to be smaller than the cardinality of the input set you'll get collisions (i.e. more input strings get "compressed" to the same compressed binary string). A compression algorithm should be reversible, right? :)
Is there anything that can make the returned length of the PHP CRC32 function to vary?
Thanks!
No, by definition a CRC32 has 32-bits.
You can only vary its representation. For instance, while it can be represented with 4 8-bit bytes (and hence fits in a PHP int), you may wish to represent that number in base 10 in a string, and then it can have 10 characters (unsigned), since 2^32-1 is 4294967295.
Is there any chance that a SHA-1 hash can be purely numeric, or does the algorithm ensure that there must be at least one alphabetical character?
Edit: I'm representing it in base 16, as a string returned by PHP's sha1() function.
technically, a SHA1 hash is a number, it is just most often encoded in base 16 (which is what PHP's sha1() does) so that it nearly always has a letter in it. There is no guarantee of this though.
The odds of a hex encoded 160 bit number having no digits A-F are (10/16)40 or about 6.84227766 × 10-9
The SHA-1 hash is a 160 bit number. For the ease of writing it, it is normally written in hexadecimal. Hexadecimal (base 16) digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e and f. There is nothing special about the letters. Each hexadecimal character equivalent to 4 bits which means the hash can be written in 40 characters.
I don't believe there is any reason why a SHA-1 hash can't have any letters, but it is improbable. It's like generating a 40 digit (base 10) random number and not getting any 7s, 8s or 9s.
You can represent the output of SHA1 (just like any binary data) in any base you want. Specifically, you can encode the result in base-8/10.