128 bit reversible encryptor/hasher to reduce DB size

128 bit reversible encryptor/hasher to reduce DB size - php

Is there anything out there for PHP that can hash/encrypt a long string into a 128 bit string that can also be reversed?
I am trying importing hundreds on millions of strings into a MySQL DB and the average string is over 100 characters, MD5 gets this down to 32 characters which significantly reduces storage however I cannot reverse this again in my application.
Does PHP have anything available that can handle this?

If I understand your question correctly, it seems to me you mix up hashing and compression quite a lot.
Most hash-functions are not easily reversible, because that is not their purpose. There are infinite "Strings/ByteStreams/Numbers/..." that correspond to the result of a hash-function. As you may know, even images that are a few Gigabytes big, also give you an md5sum of 32 characters.
You can not just magically map any String into a String of fixed length that is shorter, to just be able to magically pouff it back to its original String.
It may well be, that some hash-functions could very efficiently be used to reverse their process if you know that your target results have to have this and that property (in you case maybe character-length of 100-120), but I doubt it.
Or do I totally misunderstand and you just mean ASCII-Strings with the expression "128 bit string"?

No, you can't do this: Pigeonhole principle

Related

Laravel 4 Encryption: how many characters to expect

I've just had an interesting little problem.
Using Laravel 4, I encrypt some entries before adding them to a db, including email address.
The db was setup with the default varchar length of 255.
I've just had an entry that encrypted to 309 characters, blowing up the encryption by cutting off the last 50-odd characters in the db.
I've (temporarily) fixed this by simply increasing the varchar length to 500, which should - in theory - cover me from this, but I want to be sure.
I'm not sure how the encryption works, but is there a way to tell what maximum character length to expect from the encrypt output for the sake of setting my database?
Should I change my field type from varchar to something else to ensure this doesn't happen again?

Conclusion
First, be warned that there has been quite a few changes between 4.0.0 and 4.2.16 (which seems to be the latest version).
The scheme starts with a staggering overhead of 188 characters for 4.2 and about 244 for 4.0 (given that I did not forget any newlines and such). So to be safe you will probably need in the order of 200 characters for 4.2 and 256 characters for 4.0 plus 1.8 times the plain text size, if the characters in the plaintext are encoded as single bytes.
Analysis
I just looked into the source code of Laravel 4.0 and Laravel 4.2 with regards to this function. Lets get into the size first:
the data is serialized, so the encryption size depends on the size of the type of the value (which is probably a string);
the serialized data is PKCS#7 padded using Rijndael 256 or AES, so that means adding 1 to 32 bytes or 1 to 16 bytes - depending on the use of 4.0 or 4.2;
this data is encrypted with the key and an IV;
both the ciphertext and IV are separately converted to base64;
a HMAC using SHA-256 over the base64 encoded ciphertext is calculated, returning a lowercase hex string of 64 bytes
then the ciphertext consists of base64_encode(json_encode(compact('iv', 'value', 'mac'))) (where the value is the base 64 ciphertext and mac is the HMAC value, of course).
A string in PHP is serialized as s:<i>:"<s>"; where <i> is the size of the string, and <s> is the string (I'm presuming PHP platform encoding here with regards to the size). Note that I'm not 100% sure that Laravel doesn't use any wrapping around the string value, maybe somebody could clear that up for me.
Calculation
All in all, everything depends quite a lot on character encoding, and it would be rather dangerous for me to make a good estimation. Lets assume a 1:1 relation between byte and character for now (e.g. US-ASCII):
serialization adds up to 9 characters for strings up to 999 characters
padding adds up to 16 or 32 bytes, which we assume are characters too
encryption keeps data the same size
base64 in PHP creates ceil(len / 3) * 4 characters - but lets simplify that to (len * 4) / 3 + 4, the base 64 encoded IV is 44 characters
the full HMAC is 64 characters
the JSON encoding adds 3*5 characters for quotes and colons, plus 4 characters for braces and comma's around them, totaling 19 characters (I'm presuming json_encode does not end with a white space here, base 64 again adds the same overhead
OK, so I'm getting a bit tired here, but you can see it at least twice expands the plaintext with base64 encoding. In the end it's a scheme that adds quite a lot of overhead; they could just have used base64(IV|ciphertext|mac) to seriously cut down on overhead.
Notes
if you're not on 4.2 now, I would seriously consider upgrading to the latest version because 4.2 fixes quite a lot of security issues
the sample code uses a string as key, and it is unclear if it is easy to use bytes instead;
the documentation does warn against key sizes other than the Rijndael defaults, but forgets to mention string encoding issues;
padding is always performed, even if CTR mode is used, which kind of defeats the purpose;
Laravel pads using PKCS#7 padding, but as the serialization always seems to end with ;, that was not really necessary;
it's a nice thing to see authenticated encryption being used for database encryption (the IV wasn't used, fixed in 4.2).

#MaartenBodewes' does a very good job at explaining how long the actual string probably will be. However you can never know it for sure, so here are two options to deal with the situation.
1. Make your field text
Change the field from a limited varchar to an "self-expanding" text. This is probably the simpler one, and especially if you expect rather long input I'd definitely recommend this.
2. Just make your varchar longer
As you did already, make your varchar longer depending on what input length you expect/allow. I'd multiply by a factor of 5.
But don't stop there! Add a check in your code to make sure the data doesn't get truncated:
$encrypted = Crypt::encrypt($input);
if(strlen($encrypted) > 500){
// do something about it
}
What can you do about it?
You could either write an error to the log and add the encrypted data (so you can manually re-insert it after you extended the length of your DB field)
Log::error('An encrypted value was too long for the DB field xy. Length: '.strlen($encrypted).' Data: '.$encrypted);
Obviously that means you have to check the logs frequently (or send them to you by mail) and also that the user could encounter errors while using the application because of the incorrect data in your DB.
The other way would be to throw an exception (and display an error to the user) and of course also write it to the log so you can fix it...
Anyways
Whether you choose option 1 or 2 you should always restrict the accepted length of your input fields. Server side and client side.

Which hashing method should I use for large text? - PHP / MYSQL

Most of the text stored in my DB is from 1MB to 1.5MB big. But not bigger then 1.5MB, because that's the limit I set.
Here are my needs:
I need it for lowering my mysql database size
I need it to be as fast as possible
no security needed
it must just work correctly, so that string_1 and string_2 can never have the same hash
I use PHP and MYSQL.

A hash is not reversible. You can make a 1.5MB text into a small string with the help of hashing, but you cannot convert the same hash back into the original text.
What you are looking for is a compression algorithm. You can make the files a lot smaller with compression, but it's unlikely to be as small as a hash.

I would suggest SHA1, as it is also in use by git and similar applications to identify strings.
See: https://en.wikipedia.org/wiki/Sha1
and: http://php.net/manual/en/function.hash.php
$hash = hash( 'sha1', $inputData );

Saving space
MySQL has built-in COMPRESS() and UNCOMPRESS() functions which will save space in your DB, as well having to write extra PHP code.
Checking unique-ness
Instead of indexing TEXT columns [regardless of if they're compressed or not] you can store and index 2 relatively-small things that will guarantee that that text is unique.
A hash of the data, MD5, SHA, whatever you want.
The length of the uncompressed data.
For most hashing functions you're more likely to get hit by a meteor than have 2 identical hashes for different text strings, and having 2 indentical length and hash strings is less likely than getting hit by a meteor and lightning while winning three simultaneous lotteries.

I'm going to assume you want a compression algorithm to reduce the text size.
See http://php.net/manual/en/function.gzcompress.php.

Is a random string a good verification code

I'm generating a verification code to be used for account activation. You've probably seen this sort of thing before.
My question: if I were to generate this code with a complex formula like this:
md5(md5(time().'helloguys'.rand(0,9999)));
Is it really any better than generating just a random string of 32 characters and numbers like gj3dI3OGwo5Enf...?

No, using the hash is not better. It would be more secure (less predictable) to pick 32 random characters. (Digits are characters.) Use a good ("cryptographic") random number generator, with a good seed (some bytes from /dev/random). Don't use time as a seed.

Agree with erickson, just may advise you to use
pwgen -1 -s
command on *nix which will the job muich better of any procedure you may invent.
If you want to generate some string programmatically you may take a look at
<?php
$better_token = md5(uniqid(rand(),1));
?>
this gives very good level of randomness and prior to collisions.
If you need even higher level of security you may consider to generate random sequences on http://www.random.org/

Storing a bunch of 3bits long binary data with PHP

My PHP program is working with an array of values ranging from 0 to 7. I'm trying to find the most effective way to store those values in PHP. By most effective I mean using the less number of bits.
It's clear that each value only need 3 bits of storage space (b000=0 to b111=7). But what is the most efficient way to store those 3bits values in a binary string ?
I don't know in advance how many 3 bits values I'll need to store or restore, but it might be a lot, so 64bits is clearly not enough.
I was looking into pack() and unpack(): I could store two values in each byte and use a pack('C', $twoValues), but I'm still loosing 2 bits.
Will it work ? Is there a more effective way of storing those values ?
Thanks

You didn't ask if it was a good idea - as many suggested, your benefit of that kind of space compression, is easily lost in the extra processing - but that's another topic :)
You're also not mentioning where you're storing the data after. Whatever that storage location/engine is maybe have further conditions and specialized types (eg a database has a binary column format, might have a byte column format, may even support bit storage etc).
But sticking with the topic, I guess best 3 bit storage is as a nibble (waisting one bit) and I suppose I'd combine two nibbles into a byte (loosing two bits overall). Yes you're loosing two bits (if that's key), but it's simple to combine the two values so you're processing overhead is relatively small:
$byte=$val1*7+$val2;
$val2=$byte%7;$val1=($byte-$val2)/7;
If a byte isn't available, you can combine these up to make 16 (4 stored), 32 (8), 64 (16) bit integers. You can also form an array of these values for larger storage.
I'd consider the above more human readable, but you could also use bit-logic to combine and separate the values:
$combinedbyte=$val1<<3|$val2;
$val2=$combinedbyte&7;$val1=($combinedbyte&56)>>3);
(This is effectively what the PACK/UNPACK commands do)
Alternatively you could encode into characters, since in ASCII the first few are protected, you might as well start at A (A-Z+6 punc+a-z gives you 58 when you only need 49 to store your two values).
$char=chr(($val1*7+$val2)+65); //ord('A')=65
$val2=(ord($char)-65)%7;$val1=(ord($char)-65-$val2)/7;
A series of these encoded characters could be stored as an array or in a null terminated string.
NOTE:
In the case of -say- 64 bit integers above, we're storing 3 bits in 4 so get 64/4=16 storage locations. This means we're waisting 16 further bits (1 per location) so you might be tempted to add another 5 values, for a total of 21 (21*3=63 bits, only 1 wasted). That's certainly possible (with integer math - although most PHP instances don't work # 64 bits, or bit-logic solutions) but it complicates things in the long run - probably more trouble than it's worth.

The best way is to store them as integers and not get involved with packing things bit by bit. Unless you have an actual engineering reason you need these to be stored as 3-bit values (for example, interfacing with hardware), you're just asking for headaches. Keep in mind, esp for odd bit sizes, they become pretty difficult to have direct access to if you do this. And if you are sticking these values in a database, you wouldnt be able to search or index on values packed like this. Store them as integers, or if in a db, perhaps a short integer or byte.

That kind of technique is only necessary if you will have at least half a billion of these. Think about it, the CPU will have to have data in one register, the mask in another and AND them just to get your value out. Now imagine iterating over a list of these that is long enough to justify that kind of space saving technique. A 50% reduction in space and an order of magnitude slower.

Looking at http://php.net/manual/en/language.types.php, you should store them as integers. However, the question is whether to let one integer value represent many 3-bit values or not. The former is more complex but requires less memory, whereas the first is the opposite. If you don't have an extreme need to reduce the amount of memory you use, then I would suggest the latter (use one integer for one 3-bit value).
The main problem with storing many 3-bit values in one integer is figuring out how many 3-bit values there are. You could use an array of integers, and then have an extra integer which states the total number of 3-bit values. However, as also stated in the manual, the number of bits used for an integer value is platform-dependent. So you would have to know whether an integer is 32 bits or 64 bits, or else you may try to store too many values and lose data, or you risk using more memory than needed (which would be a bad thing as you're aiming to use as little memory in the first place).

I would convert each integer to binary, concatenate all of them, and then split the resulting string into bytes. Each byte will be 0-255 so it can be stored as an individual character.

How to convert numbers to an alpha numeric system with php

I'm not sure what this is called, which is why I'm having trouble searching for it.
What I'm looking to do is to take numbers and convert them to some alphanumeric base so that the number, say 5000, wouldn't read as '5000' but as 'G4u', or something like that. The idea is to save space and also not make it obvious how many records there are in a given system. I'm using php, so if there is something like this built into php even better, but even a name for this method would be helpful at this point.
Again, sorry for not being able to be more clear, I'm just not sure what this is called.

You want to change the base of the number to something other than base 10 (I think you want base 36 as it uses the entire alphabet and numbers 0 - 9).
The inbuilt base_convert function may help, although it does have the limitation it can only convert between bases 2 and 36
$number = '5000';
echo base_convert($number, 10, 36); //3uw

Funnily enough, I asked the exact opposite question yesterday.
The first thing that comes to mind is converting your decimal number into hexadecimal. 5000 would turn into 1388, 10000 into 2710. Will save a few bytes here and there.
You could also use a higher base that utilizes the full alphabet (0-Z instead of 0-F) or even the full 256 ASCII characters. As #Yacoby points out, you can use base_convert() for that.
As I said in the comment, keep in mind that this is not an efficient way to mask IDs. If you have a security problem when people can guess the next or previous ID to a record, this is very poor protection.

dechex will convert a number to hex for you. It won't obfuscate how many records are in a given system, however. I don't think it will make it any more efficient to store or save space, either.
You'd probably want to use a 2 way crypt function if obfuscation is needed. That won't save space, either.
Please state your goals more clearly and give more background, because this seems a bit pointless as it is.

This might confuse more people than simply converting the base of the numbers ...
Try using signed digits to represent your numbers. For example, instead of using digits 0..9 for decimal numbers, use digits -5..5. This Wikipedia article gives an example for the binary representation of numbers, but the approach can be used for any numeric base.
Using this together with, say, base-36 arithmetic might satisfy you.

EDIT: This answer is not really a solution to the question, so ignore it unless you are trying to hash a number.
My first thought we be to hash it using eg. md5 or sha1. (You'd probably not save any space though...)
To prevent people from using rainbow-tables or brute force to guess which number you hashed, you can always add a salt. It can be as simple as a string prepended to your number before hashing it.
md5 would return an alphanumeric string of exactly 32 chars and sha1 would return one of exaclty 40 chars.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.