I'm storing unique user-agents in a MySQL MyISAM table so when I have to look if it exists in the table, I check the md5 hash that is stored next to the TEXT field.
User-Agents
{
id - INT
user-agent - TEXT
hash - VARCHAR(32) // md5
}
There is any way to do the same but using a 32-bit integer and not a text hash? Maybe the md5 in raw format will be faster? That will requiere a binary search.
[EDIT]
MySQL don't handle hash searches for complete case-sensitive strings?
Store the UNHEX(MD5($value)) in a BINARY(16).
You could do this instead:
User-Agents
{
id - INT
user-agent - TEXT
hash - UNSIGNED INT (CRC32, indexed)
}
$crc32 = sprintf("%u", crc32($user_agent));
SELECT * FROM user_agents WHERE hash=$crc32 AND user_agent='$user_agent';
It's unlikely that you'll get collisions with crc32 for this kind of data.
To guarantee that collisions will not cause problems, add a secondary search parameter. MySQL will be able to use the index to quickly find the record. Then it can do a simple string search to guarantee that match is correct.
PS: The sprintf() is there to work around signed 32-bit integers. Should be unnecessary on 64-bit systems.
Let MySQL do the hard work for you. Use a CHAR column and create an index on that column. You could convert and store the hash as an integer, but there's absolutely no benefit, and it may actually cause problems.
try MurmurHash. Its a fast hashing algo thats been translated to multiple languages. It takes your input and translates it into a 32/64 bit integer hash.
You can't store an MD5 hash in a 32-bit int: it simply won't fit. (It's 32 characters when written in hex, but it's 128-bits of data)
You could look at MySQL's BINARY and VARBINARY types. See http://dev.mysql.com/doc/refman/5.1/en/binary-varbinary.html. These types store binary data. In your case, BINARY(16) or VARBINARY(16), but since MD5 hashes are always 16 bytes, the latter seems a bit pointless.
You can store MD5 hash in char(32) which is a bit faster than varchar(32).
It's also possible to make two BIGINT fields and keep first half of md5 hash in first field and second part in second field.
Are you REALLY sure the hashes are only 32-bit? MD5 is 128-bit. Cropping the hash to first 4 or 8 bytes would greatly increase risk of collisions.
If your field hash is always an MD5 value generated by PHP, then you can safely set it to CHAR(32). This should not impact the response time to your queries, unless you plan to have millions+ of rows, or even worst! JOIN other tables with this field. The bottom line is that fixed width column is better than variable ones, so if you can optimize do it.
Regarding changing MD5 into int values, see this question; the conclusion to this is that if you really want to change your MD5 into a 128-bit int value, you might as well use a random number instead of an MD5!
Have you tried creating a BINARY(16) field, and storing the result of md5($plaintext, true); in it? That might work, make sure you index that field as well.
Because trying to fit a 128-bit value in 32 bits doesn't make any sense...
Related
I'm having a question regarding the uniqueness of md5 function.
I know that md5 (with microtime value) are not unique, however, they are pretty unique :)
How can I calculate the probability of a collision between two portions of an md5 hashes?
For example: The following in php that generates a 8 chars string from md5 result:
substr(md5(microtime()), 0, 8);
A second scenario - What if the index of it is unique (so it gets a different portion of the hash each time)?
substr(md5(microtime()), rand(0, 32), 8);
There are 2^32 combinations of 8 hexadecimal digits. Even if they are completely random, you can only generate about 65000 such strings, on average, before you get 2 that are the same.
md5(), using a random index or not, doesn't significantly change anything as long as all the microtime() values use use are unique. But, if you are generating these too fast, or across many machines, then the situation is much much worse, because there's a good chance you could end up using the same microtime() value twice.
As you are asking about uniqueness of your string, it's actually a probability. Means as much string character you will use and as much the length of random string you make will get less chances of getting similar random string.
So, to get unique string you need to store string in your DB and compare with random string, if you found similar then again go for new fresh string , until you get unique string.
It depends on how many "sub-hashes" you are going to generate and how many bits you're keeping from the original MD5 hash (length of a "sub-hash"). If you generate just 1 sub-hash and keep just 1 bit then no collision at all. If you generate 2 sub-hashes expect 50% collision. Use 2 bits and the odds are 25%. You do the math. Refer to the birthday paradox for more info
I'm using Murmurhash3 to create unique hashes for text entries. When text entries are created, I'm using this php implementation, which returns a 32 bit hash integer, to get the hash value. The hash is stored in a BINARY(16) database column. I also need to update our existing database so I'm using this MySql implementation to update the database. In order to match the php created hash, I'm base converting it and lower-casing it.
UPDATE column SET hash=LOWER(CONV(murmur_hash_v3(CONCAT(column1, column2), 0), 10, 32));
It matches the php version about 80% of the time, which obviously isn't going to cut it. For example, hashing the string 'engtest' creates 15d15m in php and 3uqiuqa in MySql. However, the string 'engtest sentence' creates the same hash in both. What could I be doing wrong?
Figured it out. PHP's integer type is signed and occasionally Murmurhash was producing negative hash values that didnt match the always positive MySql values. The solution was to format php's hash value using sprintf with format set to "%u" before the base conversion.
$hash = murmurhash3_int($text);
return base_convert(sprintf("%u\n", $hash), 10, 32);
See the php crc32 docs for more info.
Back when I used MD5 I used to create a varchar(32) column in the datebase.
However, I started using crypt(), and as I understand the output length is variable.
So which length should I set to the varchar?
The maximum number of characters returned is 123 characters. http://php.net/manual/en/function.crypt.php
For those wondering, like I did, what the maximum length of the
returned hash can be for the purpose of storing it in a database, the
answer is:
123 characters.
This is the top User Contributed Note from the PHP: crypt Manual page
I have a need to store an encrypted but recoverable (by admin) password in MySQL, from PHP. AFAIK, the most straightforward way to do this is with openssl_public_encrypt(), but I'm not sure what column type is needed. Can I make any reliable judgment on the maximum length of encrypted output, based upon the size of the key and the input?
Or am I forced to use a huge field (e.g. BLOB), and just hope it works all the time?
The openssl_public_encrypt function limits the size of the data you can encrypt to the length of the key, if you use padding (recommended), you'll lose an extra 11 bytes.
However, the PKCS#1 standard, which OpenSSL uses, specifies a padding scheme (so you can encrypt smaller quantities without losing security), and that padding scheme takes a minimum of 11 bytes (it will be longer if the value you're encrypting is smaller). So the highest number of bits you can encrypt with a 1024-bit key is 936 bits because of this (unless you disable the padding by adding the OPENSSL_NO_PADDING flag, in which case you can go up to 1023-1024 bits). With a 2048-bit key it's 1960 bits instead.
Of course you should never disable padding, because that will make the same passwords to encrypt to the same value.
So for a 1024-bit key the maximum password input length is 117 chars.
For a 2048-bit key it's 245 chars.
I'm not 100% sure of the output length, but a simple trail should confirm this, the output is a simple function of the keylength, so for a 2048-bit key I suspect it is 256 bytes.
You should use a binary string with the required length to store the password.
For speed reasons it's best to use a limited length index on the field.
Do not use blob (!) because that will slow things way down for no benefit.
CREATE TABLE user
id unsigned integer auto_increment primary key,
username varchar(50) not null,
passRSA binary(256), <<-- doublecheck the length.
index ipass(passRSA(10)) <<-- only indexes the first 10 bytes for speed reasons.
) ENGINE = InnoDB
Adding extra bytes to the index will just slow things down and grow the index file for no benefit.
I use this function for hashing my passwords:
// RETURNS: rAyZOnlNBxO2WA53z2rAtFlhdS+M7kec9hskSCpeL6j+WwcuUvfFbpFJUtHvv7ji
base64_encode(hash_hmac('sha384', $str . SC_NONCE, SC_SITEKEY, true));
And I store hashes in char(64) field (MySQL -InnoDB).
Should I use varchar(64) instead of char(64)? Why?
Edit:
I changed sha256 with sha384. Because in this example, sha256 always returns 44 bytes for me. Sorry for confusing. Now it's 64-bytes.
varchars save storage by only using up to the length required. If the 64 bit hash is always 64 then it makes no difference in terms of storage so probably char is just as good as varchar in this case.
If you have variable length data to store, then a varchar will save wasting unnecessary space.
You should use CHAR(64) since your hash is fixed in length. Using VARCHAR will add another byte, wasting space.
Even though you are using a Base 64 encoded string, the result is not necessarily 64 bits in length. In this case, VARCHAR is better because the result can be shorter than 64 bits.
In fact as seen here, 64 bits is the maximum length rather than the set length.