Is there any chance that a SHA-1 hash can be purely numeric, or does the algorithm ensure that there must be at least one alphabetical character?
Edit: I'm representing it in base 16, as a string returned by PHP's sha1() function.
technically, a SHA1 hash is a number, it is just most often encoded in base 16 (which is what PHP's sha1() does) so that it nearly always has a letter in it. There is no guarantee of this though.
The odds of a hex encoded 160 bit number having no digits A-F are (10/16)40 or about 6.84227766 × 10-9
The SHA-1 hash is a 160 bit number. For the ease of writing it, it is normally written in hexadecimal. Hexadecimal (base 16) digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e and f. There is nothing special about the letters. Each hexadecimal character equivalent to 4 bits which means the hash can be written in 40 characters.
I don't believe there is any reason why a SHA-1 hash can't have any letters, but it is improbable. It's like generating a 40 digit (base 10) random number and not getting any 7s, 8s or 9s.
You can represent the output of SHA1 (just like any binary data) in any base you want. Specifically, you can encode the result in base-8/10.
Related
Currently I have some code that converts numbers into base62 format
This works fine, however when putting the data into the database, its possible to end up with two of the same base62 string (but with different case)
eg. SsUTF1 and Ssutf1
To get around this issue, is base36 a viable alternative to base62? my very limited understanding of this is that base36 wont produce same strings, but on the flip side, I am assuming the character length for bigger numbers can be longer on base 36 than base 62?
if I have strings in the database already that are base62 , is it possible to end up with duplicates after switching to base36? given that the base number the strings will be derived from are never going to be the same.
Given that the base number is different, it base36 representation will always be different. it's also true with base62 if you compare it using case sensitive or binary compare method.
Also base36 representation of the same number could be longer than the base62. Let say we have 10 positions it would mean we could represent :
base36 36E10 = 3 656 158 440 062 976 possibilities
base62 62E10 = 839 299 365 868 340 224 possibilities
Hope this help.
The docs (http://php.net/manual/de/function.crypt.php) for the crypt() function show the following example for an MD5 hash:
$1$rasmusle$rISCgZzpwk3UhDidwXvin0
I understand, that "$1$" is the prefix, which contains the information, that the hash is an MD5 hash.
But how is the rest of the string an MD5 hash? Normally it should be a 32 char string (0-9, a-f), right?
I'm sure, it's a stupid question, but I still want to ask.
Normally it should be a 32 char string (0-9, a-f), right?
That is not correct (at least strictly speaking). Technically, a MD5 hash is a 128 bit numeric value. The form that you are used to is simply a hexadecimal representation of that number. It is often chosen because they are easy to exchange as strings (128-bit integers are difficult to handle. After all, a typical integer variable usually only holds 64 bit). Consider the following examples:
md5("test") in hexadecimal (base 16) representation: 098f6bcd4621d373cade4e832627b4f6
md5("test") in base 64 representation: CY9rzUYh03PK3k6DJie09g==
md5("test") in decimal (base 10) representation: 12707736894140473154801792860916528374
md5("test") in base 27 representation (never used, just because I can and to prove my point): ko21h9o9h8bc1hgmao4e69bn6f
All these strings represent the same numerical value, just in different bases.
I'm looking for a way to convert an alphanumeric string, e.g. "aBcd3f", into a purely numeric representation, and get the shortest possible input string. The valid characters in the input string are a-z, A-Z, 0-9, and the resultant string would be comprised only of digits 0-9.
Since there are 62 valid values for each character in the input string, I can assign values 00-61 to each input character, and covert the 6 input characters into a 12 character numeric string.
But I would like to get something more compact, if possible - e.g. 8-10 digits. Is it possible, and if so, are there any algorithms or functions for doing this in PHP?
Note that this has to be a 2-way function. I also need to be able to go back from the numeric string to the alphanumeric.
I haven't found this question asked on this site. My question is the opposite of this question, as I'm trying to go in the opposite direction.
A decimal digit encodes log2(10) = 3.32 bits of information on average. Alphanumeric data has 62 possible "digits", so each one encodes log2(62) = 5.95 bits of information on average.
This means that converting from alphanumeric to decimal digits only will require approximately 5.95 / 3.32 = 1.79 times more characters in the output than there are in the input. If your output is constrained to 10 characters maximum you can expect it to encode at most 5.58 characters of alphanumeric input, which for practical purposes means just 5. There is no room for maneuvering here; this is cold math.
The manner of converting from one representation to the other is fairly straightforward, because in essence you are simply converting a number from base 62 to base 10 and back. You can tweak the code from this answer of mine only slightly to achieve the aim.
See it in action.
Note that with the (arbitrary) order of digits I picked the "largest" possible input with 5 characters is "ZZZZZ", which encodes to 9 decimal digits. If you expand the input to 6 characters the largest input would be "ZZZZZZ" which would need 11 decimal digits to encode -- more than the limit we imposed, as predicted.
Also note that this analysis assumes every possible input string is as likely to occur as any other, i.e. the input is perfectly random. If this is not the case then the actual information content of the input would be lower than the theoretical maximum and consequently you could take advantage of this with some kind of compression scheme.
I am trying to resolve the following problem via PHP. The aim is to generate a unique 6-character string based on an integer seed and containing a predefined range of characters. The second requirement is that the string must appear random (so if code 1 were 100000, it is not acceptable for code 2 to be 100001, and 3 100002)
The range of characters is:
Uppercase A-Z excluding: B, I, O, S and Z
0-9 excluding: 0, 1, 2, 5, 8
So that would be a total of 26 characters if I am not mistaken. My first idea would to be encoding from base 10 to base 24 starting at number 7962624. So do 7962624 + seed, and then base24 encode that number.
This gives me the characters 0-N. If I replace the resulting string in the following fashion, I then meet the first criteria:
B=P, I=Q, 0=R, 1=T, 2=U, 5=V, 8=W
So at this point, my codes will look something like this:
1=TRRRR, 2=TRRRT, 3=TRRRU
So my question to you gurus is: How can I make a method that behaves consistently (so the return string for a given integer is always the same) and meets the 2 requirements above? I have spent 2 full days on this now and short of dumping 700,000,000 codes into a database and retrieving them randomly I'm all out of ideas.
Stephen
You get a reasonably random looking sequence if you take your input sequence 1,2,3... and apply a linear map modulo a prime number. The number of unique codes is limited to the prime number so you should choose a large one. The resulting codes will be unique as long as you choose a multiplier that's not divisible by the prime.
Here's an example: With 6 characters you can make 266=308915776 unique strings, so a suitable prime number could be 308915753. This function therefore will generate over 300.000.000 unique codes:
function encode($num) {
$scrambled = (240049382*$num + 37043083) % 308915753;
return base_convert($scrambled, 10, 26);
}
Make sure that you run this on 64bit PHP though, otherwise the multiplication will overflow. On 32bit you'll have to use bcmath. The codes generated for the numbers 1 through 9 are:
n89a2d
hdh4jo
biopb9
5o6k2k
3eek5
k8m9aj
ee4424
8jbojf
2ojjb0
All that's left is filling in the initial 0s that are sometimes missing and replacing the letters and numbers so that none of the forbidden characters are produced.
As you can see, there's no obvious pattern, but someone with some time on their hands, enough motivation and with access to a few of this codes will be able to find out what's going on. A safer alternative is using an encryption algorithm with a small block size, such as Skip32.
When referring to the length of a hash value such as sha1 or md5 in PHP, is it correct to interpret that as the size of the hash in memory rather than the number of characters present in the literal?
Yes, it does. However, that size is tightly related to the amount of characters in the string -- if you get a raw string, you'll get 1 character per 8 bits; if you get hex digits (the default), you're getting 1 character per 4 bits.
It's the minimum number of bits required to store the hash unambiguously.
>>> len(hashlib.md5('foo').digest()) * 8
128
>>> len(hashlib.sha1('foo').digest()) * 8
160
>>> len(hashlib.sha512('foo').digest()) * 8
512
The principal output of a secure hash function is always defined in bits. So when referring to the output of a hash function a cryptographer always talks about e.g. 128 bits for the broken MD5 algorithm, 160 bits for SHA1 and obviously 256 bits for SHA-256.
Most crypto APIs however only work with bytes. This means that if there is a specific method present to indicate hash size, that more often than not the size in bytes is returned. So that would be 16, 20 and 32 bytes for the above algorithms.
Of course, the bytes are returned in e.g. hexadecimals then the length in characters of the string would be double that. The string length should then return 32, 40 or 64 characters. If that translates to an identical number of bytes depends on the character encoding (e.g. using UTF-16 would double the number of bytes).
Hash functions do have a big internal state, so the number of bytes taken by a running implementation is much higher than number of bits in the output. It is not that high that you would notice on a modern PC though.