I'm using Murmurhash3 to create unique hashes for text entries. When text entries are created, I'm using this php implementation, which returns a 32 bit hash integer, to get the hash value. The hash is stored in a BINARY(16) database column. I also need to update our existing database so I'm using this MySql implementation to update the database. In order to match the php created hash, I'm base converting it and lower-casing it.
UPDATE column SET hash=LOWER(CONV(murmur_hash_v3(CONCAT(column1, column2), 0), 10, 32));
It matches the php version about 80% of the time, which obviously isn't going to cut it. For example, hashing the string 'engtest' creates 15d15m in php and 3uqiuqa in MySql. However, the string 'engtest sentence' creates the same hash in both. What could I be doing wrong?
Figured it out. PHP's integer type is signed and occasionally Murmurhash was producing negative hash values that didnt match the always positive MySql values. The solution was to format php's hash value using sprintf with format set to "%u" before the base conversion.
$hash = murmurhash3_int($text);
return base_convert(sprintf("%u\n", $hash), 10, 32);
See the php crc32 docs for more info.
Related
I'm building a MySQL database with a table that will store lots of rows (say, like 1.000.000).
Each row will have a numeric ID but I don't want to make it incremental, instead it has to be generated from a unique string.
For example, a user ABC will create a new element at time 123, so the original string will be "ABC-123". A PHP function will "translate" it to a number.
This way, I'll have the possibility to re-generate the same ID from the same pair of data in future. More or less... see it as a Java hashCode() function.
I've found this function that "translates" a string into a number:
function hashCode($string) {
return base_convert(substr(md5($string), 0, 16), 16, 10);
}
I have some doubts about it. First, it starts from creating an md5 hash which is 32 characters long, then cuts it to 16. It's a visible lack of data so how could that be an unique hash?
Second, the produced 16-digits number is converted from base-16 to base-10, so the max value is 18446744073709552046. The MySQL column that will store this number has an UNSIGNED BIGINT datatype so the maximum value is 18446744073709551615. It's not enough since
18446744073709551615 - 18446744073709552046 = -431
Am I missing something, or is there a better way to do what I need?
Back when I used MD5 I used to create a varchar(32) column in the datebase.
However, I started using crypt(), and as I understand the output length is variable.
So which length should I set to the varchar?
The maximum number of characters returned is 123 characters. http://php.net/manual/en/function.crypt.php
For those wondering, like I did, what the maximum length of the
returned hash can be for the purpose of storing it in a database, the
answer is:
123 characters.
This is the top User Contributed Note from the PHP: crypt Manual page
I am using PHP and MySQL to create a string from a value and later compare it to a MD5 hash of the same value.
For instance, in MySQL i have a string value: somerandomvalue
In PHP I get that string value and transfer it to a local variable to hold the string value: $prdAlias
I transform the string value to a MD5 hash value:
$prdAlias = md5($prdAlias);
Then I take only the first 6 characters of that value for use later:
$prdAlias = mb_substr($prdAlias, 0, 6);
LATER
I have the first 6 characters of the MD5 value, I call it: $prdAlias
Now in MySQL i want to compare $prdAlias to the value that i started off with: somerandomvalue. To do that, I must convert the value in the database to a MD5 hash then take only the first 6 characters of the hash and compare that to $prdAlias
So I have a prepared statement:
if ($stmt = $link->prepare("
SELECT alias
FROM `products`
WHERE alias = ?
"))
{
... ETC
}
My question now is within this statement, how could i convert the alias value to MD5 and take only the first 6 characters of that to use in the WHERE clause?
Any assistance would be greatly appreciated, thank you!
EDIT: I am currently running a while loop and checking for the value by processing each row until a match is found... This is not ideal with thousands of rows.
You could use the mysql MD5() function to do this on the database server:
WHERE LEFT(MD5{alias), 6) = ?
but that would still require a full table scan, so would basically be identical to your while loop. If you want this to be fast, you need an index. I don’t think mysql has computed indexes that means you will have to add a column for the first six characters of the md5 of alias and compare against that.
I would personally store the whole hash and do a LIKE "123456%" though. Just a gut feeling that might be smarter in the long run. On the other hand if you only store the first six characters, you could add a unique key on that column and detect collisions early on.
I'm trying to set up a login system, but I can't solve one problem:
PHP is giving me an other output with md5(); than MySQL...
For example, in PHP:
$password = md5("brickmasterj");
return $password;
Returns:
3aa7b18f304e2e2a088cfd197351cfa8
But the MySQL equivalent gives me a shorter version:
3aa7b18f304e2e2a08
What's the problem? And how do I work with this while checking passwords?
I guess the problem in the length of column of your table, set the length of password field to at least 32
No way MySQL returns it of a length of < 32. If you would do a simple query like SELECT md5('brickmasterj'), you would see. Now you are most likely inserting the value into a column which is not wide enough.
Is your database field 32 characters long? Are you writing to the database using mysql's md5?
The hash size if always fixed. In your case the hash size is 128 bits. When converted to a ascii string it would be a 32 character string that contains only hexadecimal digits.
so if you are storing variable character the length should be atleast 32
example:password varchar(32)
should go in mysql table then you can call using php using
select password from table where password =md5($password);
I'm storing unique user-agents in a MySQL MyISAM table so when I have to look if it exists in the table, I check the md5 hash that is stored next to the TEXT field.
User-Agents
{
id - INT
user-agent - TEXT
hash - VARCHAR(32) // md5
}
There is any way to do the same but using a 32-bit integer and not a text hash? Maybe the md5 in raw format will be faster? That will requiere a binary search.
[EDIT]
MySQL don't handle hash searches for complete case-sensitive strings?
Store the UNHEX(MD5($value)) in a BINARY(16).
You could do this instead:
User-Agents
{
id - INT
user-agent - TEXT
hash - UNSIGNED INT (CRC32, indexed)
}
$crc32 = sprintf("%u", crc32($user_agent));
SELECT * FROM user_agents WHERE hash=$crc32 AND user_agent='$user_agent';
It's unlikely that you'll get collisions with crc32 for this kind of data.
To guarantee that collisions will not cause problems, add a secondary search parameter. MySQL will be able to use the index to quickly find the record. Then it can do a simple string search to guarantee that match is correct.
PS: The sprintf() is there to work around signed 32-bit integers. Should be unnecessary on 64-bit systems.
Let MySQL do the hard work for you. Use a CHAR column and create an index on that column. You could convert and store the hash as an integer, but there's absolutely no benefit, and it may actually cause problems.
try MurmurHash. Its a fast hashing algo thats been translated to multiple languages. It takes your input and translates it into a 32/64 bit integer hash.
You can't store an MD5 hash in a 32-bit int: it simply won't fit. (It's 32 characters when written in hex, but it's 128-bits of data)
You could look at MySQL's BINARY and VARBINARY types. See http://dev.mysql.com/doc/refman/5.1/en/binary-varbinary.html. These types store binary data. In your case, BINARY(16) or VARBINARY(16), but since MD5 hashes are always 16 bytes, the latter seems a bit pointless.
You can store MD5 hash in char(32) which is a bit faster than varchar(32).
It's also possible to make two BIGINT fields and keep first half of md5 hash in first field and second part in second field.
Are you REALLY sure the hashes are only 32-bit? MD5 is 128-bit. Cropping the hash to first 4 or 8 bytes would greatly increase risk of collisions.
If your field hash is always an MD5 value generated by PHP, then you can safely set it to CHAR(32). This should not impact the response time to your queries, unless you plan to have millions+ of rows, or even worst! JOIN other tables with this field. The bottom line is that fixed width column is better than variable ones, so if you can optimize do it.
Regarding changing MD5 into int values, see this question; the conclusion to this is that if you really want to change your MD5 into a 128-bit int value, you might as well use a random number instead of an MD5!
Have you tried creating a BINARY(16) field, and storing the result of md5($plaintext, true); in it? That might work, make sure you index that field as well.
Because trying to fit a 128-bit value in 32 bits doesn't make any sense...