Efficiently comparing hashes in a MySQL data set

Efficiently comparing hashes in a MySQL data set - php

I have an encrypted database (MySQL) which has certain columns that need to be searched by a (non-public, authorised) user.
I have been reading this article on Database SE about searching encrypted fields in a database.
I have come to a conclusion from reading this, that the best way to do this would be with a hash of the column values to compare with the given data, so:
Example:
Searching for a phone number.
HTML: User enters phone number to search for
PHP: phone number is hashed
MySQL: searches hashed_phone column(s) to compare (=) and returns
matching results.
PHP: matching result rows are decrypted and then output to user.
Given the small caveat that I can't search part of a phone number (which would be ideal in concept but well outside the scope of this specific question), I find an issue that:
Using secured hashing algorithms (password_hash, etc) all use a salt as a best practise, and this entirely makes sense BUT gives me an issue in that if I create the hash of the search term plus the salt, this makes the hash different from the stored database value, I can't then use a comparison operator to find the correct row(s) from the Dataset.
In Summary:
How can I solve this issue and have encrypted data that can be subject to some degree of searching without needing to decrypt each row of the dataset individually?
Can I use a secure hashing algorithm that does not use a salt (as opposite as that sounds), without leaving a [potential] gaping hole in the security risk of the data at rest (at least, for those columns)?
Is there an alternative methodology I've not thought of?
I would like the MySQL side of things to be as efficient as possible, there are thousands of rows in the dataset so going through each one to then decrpyt and check is deeply inefficient.
How can I do this?

Related

PHP MySQL table search by string - use hashing?

Using PHP, I have a MySQL database with an Actions table, in which a user optionally assigns actions to some pages in their website. Each such assignment results in an action row, containing (among other things) a unique ActionId and the URL of the appropriate page.
Later on, when in a context of a specific page, I want to find out if there is an action assigned to that page, and fetch (SELECT) the appropriate action row. At that time I know the URL of my page, so I can search the Actions table, by this relatively long string. I suspect this is not an optimal way to search in a database.
I assume a better way would be to use some kind of hashing which converts my long URL strings into integers, making sure no two different URLs are converted into the same integer (encryption is not the issue here). Is there such a PHP function? Alternatively, is there a better strategy for this?
Note I have seen this: SQL performance searching for long strings - but it doesn't really seem to come up with a firm solution, apart from mentioning md5 (which hashes into a string, not to integer).

The hashing strategy is a good strategy.
Dealing with the URL strings might indeed be a problem, because they can be very long, and contain a lot of special chars, which are always problematic for MySQL search (REGEXP or LIKE).
That is why hashing solves the problem. Even md5 which is not a good hashing function to hash passwords (because it's not secure anymore), is good to hash URL.
This way you will have http://www.stackoverflow.com changed into 4c9cbeb4f23fe03e0c2222f8c4d8c065, and that will be pretty much unique (unless you are very very unlucky).
Once you have your md5_url field set up, you can search with :
SELECT * FROM Actions where md5_url=?
Where the ? is an md5($url) of current URL.
Of course be sure to set an index on your md5_url field :
ALTER TABLE Actions
ADD md5_url varchar(32),
ADD KEY(md5_url);

If you add an index to the column, the database should take care of efficiency for you, and the length of the URL should make no difference.

Unique token in CakePHP

I need to create truly unique token when inserting records in CakePHP. The table can contain millions of rows so I cant just base on some randomly generated strings. I do not want to use a microtime() as well, because there is, though very small probability that two records can be submitted exactly at the same moment.
Of course the best solution would be to use String::uuid(), but as from cakephp documentation
The uuid method is used to generate unique identifiers as per RFC 4122. The uuid is a 128bit string in the format of 485fc381-e790-47a3-9794-1337c0a8fe68.
So, as far as I understood it does not use cake's security salt for its generation. So, I decided to hash it by security component's hash function (or Auth Password function), because I need it to be unique and very, really very secure at the same time. But then I found the question, saying that it is not a good idea, but for php uniqid and md5.
Why is MD5'ing a UUID not a good idea?
And, also I think the string hashed by security component is much harder to guess - because, for example String::uuid() in for loop has an output like this
for ($i = 0; $i < 30; $i++) {
echo String::uuid()."<br>";
}
die;
// outputs
51f3dcda-c4fc-4141-aaaf-1378654d2d93
51f3dcda-d9b0-4c20-8d03-1378654d2d93
51f3dcda-e7c0-4ddf-b808-1378654d2d93
51f3dcda-f508-4482-852d-1378654d2d93
51f3dcda-01ec-4f24-83b1-1378654d2d93
51f3dcda-1060-49d2-adc0-1378654d2d93
51f3dcda-1da8-4cfe-abe4-1378654d2d93
51f3dcda-2af0-42f7-81a0-1378654d2d93
51f3dcda-3838-4879-b2c9-1378654d2d93
51f3dcda-451c-465a-a644-1378654d2d93
51f3dcda-5264-44b0-a883-1378654d2d93
So, after all the some part of the string is similar, but in case of using hash function the results are pretty different
echo Security::hash('stackoverflow1');
echo "<br>";
echo Security::hash('stackoverflow2');
die;
// outputs
e9a3fcb74b9a03c7a7ab8731053ab9fe5d2fe6bd
b1f95bdbef28db16f8d4f912391c22310ba3c2c2
So, the question is, can I after all hash the uuid() in Cake? Or what is the best secure way to get truly unique and hashed (better according to my security salt) secure token.
UPDATE
Saying secure token, I mean how difficult it is for guessing. UUID is really unique, but from the example above, some records have some similarity. But hashed results do not.
Thanks !!

I don't think you need to worry about the UUIDs overlapping.
To put these numbers into perspective, the annual risk of someone being hit by a meteorite is estimated to be one chance in 17 billion,[38] which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. Or, to put it another way, the probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
Continue to use String::uuid() and rest easy :)

A UUID is unique
I need to create truly unique token when inserting records in cakphp
That is exactly what a UUID is. It is normally used in distributed systems to prevent collisions (multiple sources inserting data, possibly out of sync, into a datasource).
A UUID is not a security measure
I need it to be unique and very, really very secure at the same time
Not sure in what way hashing a uuid is supposed to enhance security - it won't. Relying on security by obscurity is more or less guaranteed to fail.
If your need is random tokens of some form - use a hash function (Hashing a uuid is simply hashing a random seed), if you need guaranteed-unique identifiers use UUIDs. They aren't the same thing and a UUID is a very poor mechanism of generating random, non-sequential "un-guessable" (or whatever the purpose is) strings.

Generating a random string suitable for cryptographic purposes was answered well here:
Secure random number generation in PHP
The code sample fills the string $pr_bits with random binary data, so the characters are unprintable. To use this in a URL, you could convert the binary data to printable characters a couple ways. None of them enhance the security but make them ready for URLs.
convert bytes to hex: bin2hex($pr_bits)
convert bytes to base64: base64_encode($pr_bits)
hash the bytes (because the output is conveniently in hex, not for added security): string hash ('md5' , $pr_bits)
I include the last one because you will see people use hash functions for other reasons, like to guarantee the output is 16bytes/128bits for md5. In PHP people use it to convert a value into HEX.

I have come up with the following solution
to use a string as a result of concatenating current time in microseconds and random string's hash
$timeStr = str_replace("0.", "", microtime());
$timeStr = str_replace(" ", "", $timeStr);
echo Security::hash('random string').'_'.$timeStr;
// 5ffd3b852ccdd448809abb172e19bbb9c01a43a4_796473001379403705
So, the first part(hash) of the string will contribute for the unguessability of the token, and the second part will guarantee its uniquenes.
Hope, this will help someone.

Wildcard searching of encrypted data in a MySQL database?

I am in the process of building a small web application which will hold around 10 pieces of information for every person inserted. Due to data protection the majority of this information must be encrypted.
Using the CodeIgniter framework and the CodeIgniter encryption class I can encode the information on the application side before storing it in the database. The CodeIgniter encryption class uses PHP's mcrypt function along with the AES_256 cipher.
The problem I have is that I need to allow the users of the application to search the information stored using a wildcard search, Possibly also via an API at a later date.
Any body come across a solution for a similar problem. I've read about the MySQL AES_ENCRYPT and AES_DECRYPT but they still require passing a key back and forth in plain text which I am reluctant to do.
I am currently at the conclusion that if I wish to continue on this route then a full table decryption is my only solution every time a search is made (obviously not good).

Well, you can't search in decrypted text without decoding it first, that is true.
However, that doesn't mean that there are no ways around this. For example, you could make an inverted index of your data and hash (sha1, md5, crc32, pick one) the keys used for searching. All you have to do then is hash the search terms you're using, look them up in the index and retrieve any record that matches, which will only be a small part of the table instead of the entire thing.
By hashing the data (use a salt!), you avoid storing the data in an unsafe way, while you can still search through the data because you made an index for it. No decryption required until you're actually sure which documents match.

Generating a checksum or hash for field values of a record

I want to have a column containing a hash or checksum as a single value to compare records. Is it possible to do something like this in pure SQL ? Do you see this as practical versus a programmatic solution in PHP ?

Yes, this is possible. MD5(), SHA1(), and SHA2() are all workable hash functions.
You must compute the values of these functions row by row and query by query. For example, every time you insert a row you'll have to insert the hash value, doing something like this:
INSERT INTO x
name=?name,
address=?address,
hash=MD5(CONCAT(?name,?address))
You'll need to do something similar on each update. If you get it wrong, which is remarkably easy especially if your table structure changes, your hashes become worse than useless.
By the way, MD5 isn't cryptographically secure for authentication any more. However, it's still an acceptable choice for this kind of hashing in a closed system.

Dealing with large amount of data in php/mysql

So i have something like a auction and for each deal or that auction i have to generate random identifier code and assign to user.
So i came up with something like this for db storage {1:XKF3325A|ADSTD2351;7:ZARASR23;12:3290OASJX} - so what i have there is user id : random code and some user can have several random codes seperated by |.
My question is what type of storing i should use for my db? The codes i generate for users might be over 2k-3k.

Perhaps you're asking the wrong question. You simply shouldn't be dealing with identifiers of this size. The answer is to deal with smaller identifiers by using a better identifier generation algorithm. Your table indexes will thank you.
So the question you should be asking is: How do I create short but unique identifiers? The answer is as follows:
You should use a cryptographically secure hash algorithm (eg. SHA-1) to generate your identifier, or just use a UUID implementation to do this. PHP has a UUID implementation called uniqid (might have to be compiled in), so there's really no need to roll your own. Both methods give you an ID that is way shorter than what you're using, and both can "guarantee" uniqueness across a huge sample size (and more effectively than your algorithm). And when I say shorter, I'm talking anywhere between 16-64 bytes at most (SHA-1 generates 40byte hashes).
If you go the SHA-1 hash route, the methodology would be to hash some random (but unique-to-the-user input), like sha1(timestamp+username+itemname+randomseed). You can also make use of uniqid here (see the comments in the PHP documentation for the function) and do: sha1(uniqid()). Make sure to read the notes about uniqid() to see the caveats about generating many ids per second.

So what if they are over 3k? MySQL can deal with huge stuff, don't worry. Just use two tables, like:
Users
- id
- other info..
Identifiers
- random_str
- foreign key to user
Then you can fetch identifiers for an id, fetch the user that an identifier belongs to, etc

First - see Loren Segal's answer for some good ideas.
Second - why would you use anything random to ensure uniqueness? Random values don't guarantee uniquness -- even if they can make it very unlikely that you'll have a collision.
In most every case, relational databases can solve your problem without any tricks. Indexes and composite keys are designed to solve this kind of problem efficiently.
If you need a single value to uniquely identify something, look into the uniquid stuff that Loren mentioned.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Efficiently comparing hashes in a MySQL data set - php

Related

PHP MySQL table search by string - use hashing?

Unique token in CakePHP

Wildcard searching of encrypted data in a MySQL database?

Generating a checksum or hash for field values of a record

Dealing with large amount of data in php/mysql

Categories

Resources