Storing a bunch of 3bits long binary data with PHP - php

My PHP program is working with an array of values ranging from 0 to 7. I'm trying to find the most effective way to store those values in PHP. By most effective I mean using the less number of bits.
It's clear that each value only need 3 bits of storage space (b000=0 to b111=7). But what is the most efficient way to store those 3bits values in a binary string ?
I don't know in advance how many 3 bits values I'll need to store or restore, but it might be a lot, so 64bits is clearly not enough.
I was looking into pack() and unpack(): I could store two values in each byte and use a pack('C', $twoValues), but I'm still loosing 2 bits.
Will it work ? Is there a more effective way of storing those values ?
Thanks

You didn't ask if it was a good idea - as many suggested, your benefit of that kind of space compression, is easily lost in the extra processing - but that's another topic :)
You're also not mentioning where you're storing the data after. Whatever that storage location/engine is maybe have further conditions and specialized types (eg a database has a binary column format, might have a byte column format, may even support bit storage etc).
But sticking with the topic, I guess best 3 bit storage is as a nibble (waisting one bit) and I suppose I'd combine two nibbles into a byte (loosing two bits overall). Yes you're loosing two bits (if that's key), but it's simple to combine the two values so you're processing overhead is relatively small:
$byte=$val1*7+$val2;
$val2=$byte%7;$val1=($byte-$val2)/7;
If a byte isn't available, you can combine these up to make 16 (4 stored), 32 (8), 64 (16) bit integers. You can also form an array of these values for larger storage.
I'd consider the above more human readable, but you could also use bit-logic to combine and separate the values:
$combinedbyte=$val1<<3|$val2;
$val2=$combinedbyte&7;$val1=($combinedbyte&56)>>3);
(This is effectively what the PACK/UNPACK commands do)
Alternatively you could encode into characters, since in ASCII the first few are protected, you might as well start at A (A-Z+6 punc+a-z gives you 58 when you only need 49 to store your two values).
$char=chr(($val1*7+$val2)+65); //ord('A')=65
$val2=(ord($char)-65)%7;$val1=(ord($char)-65-$val2)/7;
A series of these encoded characters could be stored as an array or in a null terminated string.
NOTE:
In the case of -say- 64 bit integers above, we're storing 3 bits in 4 so get 64/4=16 storage locations. This means we're waisting 16 further bits (1 per location) so you might be tempted to add another 5 values, for a total of 21 (21*3=63 bits, only 1 wasted). That's certainly possible (with integer math - although most PHP instances don't work # 64 bits, or bit-logic solutions) but it complicates things in the long run - probably more trouble than it's worth.

The best way is to store them as integers and not get involved with packing things bit by bit. Unless you have an actual engineering reason you need these to be stored as 3-bit values (for example, interfacing with hardware), you're just asking for headaches. Keep in mind, esp for odd bit sizes, they become pretty difficult to have direct access to if you do this. And if you are sticking these values in a database, you wouldnt be able to search or index on values packed like this. Store them as integers, or if in a db, perhaps a short integer or byte.

That kind of technique is only necessary if you will have at least half a billion of these. Think about it, the CPU will have to have data in one register, the mask in another and AND them just to get your value out. Now imagine iterating over a list of these that is long enough to justify that kind of space saving technique. A 50% reduction in space and an order of magnitude slower.

Looking at http://php.net/manual/en/language.types.php, you should store them as integers. However, the question is whether to let one integer value represent many 3-bit values or not. The former is more complex but requires less memory, whereas the first is the opposite. If you don't have an extreme need to reduce the amount of memory you use, then I would suggest the latter (use one integer for one 3-bit value).
The main problem with storing many 3-bit values in one integer is figuring out how many 3-bit values there are. You could use an array of integers, and then have an extra integer which states the total number of 3-bit values. However, as also stated in the manual, the number of bits used for an integer value is platform-dependent. So you would have to know whether an integer is 32 bits or 64 bits, or else you may try to store too many values and lose data, or you risk using more memory than needed (which would be a bad thing as you're aiming to use as little memory in the first place).

I would convert each integer to binary, concatenate all of them, and then split the resulting string into bytes. Each byte will be 0-255 so it can be stored as an individual character.

Related

128 bit reversible encryptor/hasher to reduce DB size

Is there anything out there for PHP that can hash/encrypt a long string into a 128 bit string that can also be reversed?
I am trying importing hundreds on millions of strings into a MySQL DB and the average string is over 100 characters, MD5 gets this down to 32 characters which significantly reduces storage however I cannot reverse this again in my application.
Does PHP have anything available that can handle this?
If I understand your question correctly, it seems to me you mix up hashing and compression quite a lot.
Most hash-functions are not easily reversible, because that is not their purpose. There are infinite "Strings/ByteStreams/Numbers/..." that correspond to the result of a hash-function. As you may know, even images that are a few Gigabytes big, also give you an md5sum of 32 characters.
You can not just magically map any String into a String of fixed length that is shorter, to just be able to magically pouff it back to its original String.
It may well be, that some hash-functions could very efficiently be used to reverse their process if you know that your target results have to have this and that property (in you case maybe character-length of 100-120), but I doubt it.
Or do I totally misunderstand and you just mean ASCII-Strings with the expression "128 bit string"?
No, you can't do this: Pigeonhole principle

PHP rounding my numbers?

I'm doing an API call which is being outputted in JSON,
The product field is "ProductID":3468490060026049912
I convert to PHP, json_decode()
Then I output the "Product ID" = float(3.4684900600261E+18)
It gets changed to a float which is rounded, I input this figure into MYSQL and it stays as the rounded figure.
How do I convert from JSON to PHP without it rounding, I need it correct to 19 digits?
You can use the JSON_BIGINT_AS_STRING flag int he $options parameter of json_decode. You will have a string, though.
You probably don't need to store these IDs as integers; it's not like you're going to do any maths on them.
Store them as strings, and you won't have any issues with precision, no matter how many digits they are.
The only reason you'd need to store these as integers is if you're using them as your primary ID field in the database and its doing auto-increment for new records.
This is also the correct way to handle storage for phone numbers, zip codes, and other data that is formatted as a number but is actually just an arbitrary sequence.
The value is too big to be stored as an integer on 32 bit machines (the maximum is about 2*109), that's why PHP converts it to float.
On 64 bit machines PHP can be stored as integer but its value is really close to the limit (which is about 9*1018). Another ID that starts with 9 or is one digit longer doesn't fit and is converted by PHP to float.
But floating point numbers loose the least significant digits.
As a quick workaround you can pass JSON_BIGINT_AS_STRING as the fourth parameter to json_decode() but it doesn't fixes the source of the problem. You can encounter the same problem on the Javascript or the database code.
You better make "ProductID" a string everywhere. It is an identifier, anyway. You don't need to do math operations with it, just to store it and search for it.
Fixed with setting ini_set('precision',19); at the top.

Sequential number obfusication for puzzle site

I'm working on a site that generates a random puzzle and the exact puzzle can be recreated using this number. So i give them the url to the puzzle in case they want to share it with a friend or solve it later etc. somepuzzlesite.com/4233312409408127365 would generate a unique puzzle that is always the same if they use that link/number
What I don't want is to expose how the puzzle is generated. The 9th digit, for example, can be 0 to 3, and defines the rotation of the puzzle.
If I just use it "as is" then a user could change a single digit in the url, see what changes, and eventually discover how I make my puzzle. I also wouldn't mind if my number were smaller, since I don't need all the way to 9:
digits 1st to 8th [possible values 0 to 5]
digit 9 [value 0 to 3]
digits 11th to 20th represent the arrangement of 10 objects in order.
I could just specify the first 9 objects in order, and then the unmentioned item is assumed to be last. (that gets me down to 9 digits used)
I could change the base, or use alpha characters in my URL in addition to digits, but some alpha characters are always trouble - lowercase "L" and "1" get mixed up easily, and "o" and zero can too.
But to keep the question simple, I'd just like to make it so that changing a single digit would represent a totally different number, and thereby create a totally different puzzle, rather than the minor difference that would result if I only changed one factor.
Let's see... a rather naive approach would be this: Assign each value so many bits as is necessary to hold it. That is, you'd have eight 3-bit values, one 2-bit value, and ten 4-bit values. That's 8*3+2+10*4=66 bits. Well, if you skip that last one, you'll get 62 bits. You can get it even smaller, but that gets unnecessarily complicated.
Anyway.
Just take any standard encryption algorithm and apply it to these 62 bits. The industry-standard AES (aka Rijndael) operates on 128-bit blocks, which might be a bit too lengthy - or maybe not, depending on your preferences. 3DES won't be any worse for your purposes, and works on 64-bit blocks, which is just perfect.
When you've got your encrypted 64 or 128 bits, just hex-encode them and make that the URL. If it's 64 bits, you'll have 16 hex characters. Not too much. And you'd be hard pressed to go lower anyway. Plus, it uses only 0-9, A-F, and there is little chance of mix-ups when calling over the phone. Not that people often share links vocally these days. :P
Your number is about 18 digits or about 61-62 bits in size. That means that it will fit nicely in a single DES block (8 bytes, or 64 bits). If you encrypt it in ECB mode you would retrieve a 64 bit value, which looks like a random value. You can leave the key on the server. A single 8 byte DES key should be enough for obfuscation, but you could also use 16/24 byte key for DESede encryption.
So: when generating a new random puzzle: create your number, convert it into a byte array with a length of 8 bytes (or N * 8 bytes if your number gets too big) then encrypt it with a single key kept on the server (8, 16 or 24 randomly generated bytes) and on some backup. The result will be 8 bytes again, which you can convert to a number of about 20 digits. If the user supplies a previously generated number, you can decrypt it with the key on the server, revert the resulting bytes back into the number used to create the puzzle.
Note that if the user just enters some random number, it will still decrypt, so you may want to check the resulting number for validity (e.g. test if a digit is indeed 0..3 and not something else).
Another approach to solve this would be to save the puzzles internally and bind the puzzle to an unique ID.

Storing something as unsigned tinyint in php

Is there a way to explicitly store my numbers in php as tinyint (1 byte instead of 4).
Or could i only enforce this by storing them 4 by 4 in an int? (using a few binary operations)
I generate these values by breaking a string using str_split and interpretting these bytes as ints via unpack( 'C' , .. ).
Currently i store these values in an array as invdividual integers but it could save alot of space if i could store them somehow as tinyints.
PHP has two data types that you may want to use here: integer and string.
PHP doesn't have any other types you could choose from (float wouldn't be a good choice for integers, the other types are not appropriate).
An int is usually 32 or 64 bits, a string is 1 byte per character.* I propose that unless you have a lot of numbers, you won't ever see any problem with 32 bit ints. If you absolutely positively want to safe space memory** and your numbers have a maximum of 3 digits, you could handle your numbers as strings. There's even the BCMath extension that'll let you operate on string numbers directly without needing to cast them back and forth. It's quite a lot of hassle for possibly very limited gain though.
Seeing that a MySQL TINYINT is usually used for boolean values though, please be aware PHP does have a boolean type...!
* One byte per one-byte character, that is.
** Since PHP scripts are usually only very temporary, you should only have problems with peak memory usage, not storage space. Getting more RAM may be the more efficient solution than playing with types.

How to convert numbers to an alpha numeric system with php

I'm not sure what this is called, which is why I'm having trouble searching for it.
What I'm looking to do is to take numbers and convert them to some alphanumeric base so that the number, say 5000, wouldn't read as '5000' but as 'G4u', or something like that. The idea is to save space and also not make it obvious how many records there are in a given system. I'm using php, so if there is something like this built into php even better, but even a name for this method would be helpful at this point.
Again, sorry for not being able to be more clear, I'm just not sure what this is called.
You want to change the base of the number to something other than base 10 (I think you want base 36 as it uses the entire alphabet and numbers 0 - 9).
The inbuilt base_convert function may help, although it does have the limitation it can only convert between bases 2 and 36
$number = '5000';
echo base_convert($number, 10, 36); //3uw
Funnily enough, I asked the exact opposite question yesterday.
The first thing that comes to mind is converting your decimal number into hexadecimal. 5000 would turn into 1388, 10000 into 2710. Will save a few bytes here and there.
You could also use a higher base that utilizes the full alphabet (0-Z instead of 0-F) or even the full 256 ASCII characters. As #Yacoby points out, you can use base_convert() for that.
As I said in the comment, keep in mind that this is not an efficient way to mask IDs. If you have a security problem when people can guess the next or previous ID to a record, this is very poor protection.
dechex will convert a number to hex for you. It won't obfuscate how many records are in a given system, however. I don't think it will make it any more efficient to store or save space, either.
You'd probably want to use a 2 way crypt function if obfuscation is needed. That won't save space, either.
Please state your goals more clearly and give more background, because this seems a bit pointless as it is.
This might confuse more people than simply converting the base of the numbers ...
Try using signed digits to represent your numbers. For example, instead of using digits 0..9 for decimal numbers, use digits -5..5. This Wikipedia article gives an example for the binary representation of numbers, but the approach can be used for any numeric base.
Using this together with, say, base-36 arithmetic might satisfy you.
EDIT: This answer is not really a solution to the question, so ignore it unless you are trying to hash a number.
My first thought we be to hash it using eg. md5 or sha1. (You'd probably not save any space though...)
To prevent people from using rainbow-tables or brute force to guess which number you hashed, you can always add a salt. It can be as simple as a string prepended to your number before hashing it.
md5 would return an alphanumeric string of exactly 32 chars and sha1 would return one of exaclty 40 chars.

Categories