Compress a binary matrix - php

I have to send a large set of data containing biological samples which are two-dimensional SQUARE array of 1's and 0's.
Say for example:
[
[0,1,0],
[1,0,0],
[0,0,0]
]
So- this was 3-dimensional. Mine goes 60-70 and expected to go to 120 rows/columns (max). I have to send this via Ajax/API and also store into database.
I could serialise as Json- But I was looking if there was an optimal way to handle things like this. Like, with a series of proper compression/decompression?
One way I could think of it is:
Join the digits as string
Divide in clusters of 6 digits. 111111 bin = 63 dec ( A-Z, a-z, 0-9,_ = 26 + 26 + 10 + 1)
Convert each cluster to alphanumeric encoding (a-zA-Z0-9_) and join as string
How smart/stupid/optimal is this solution? Is something better already out there?

Converting your data structure to JSON, then passing it through a standard compression algorithm like gzdeflate() is about as simple as you can get, and will result in excellent compression ratios. There's probably no reason to make it any more complicated than that.
(The output of gzdeflate will be binary data. If you need to transfer it over a channel which can't deal with that, you can base64_encode it; the results will still be smaller than the original JSON for a matrix of any meaningful size.)
"Flattening" the matrix into a single string of 1s and 0s (and storing the dimensions of the original matrix alongside the string) before compressing it may give you slightly better compression ratios, but at the expense of making your code more complicated.
Performing alphanumeric encoding on the matrix like you're describing in your question will result in significantly worse compression ratios, as it will make it very difficult for the DEFLATE algorithm to detect any patterns in your data which don't "line up" perfectly with the size of your clusters.

Related

Storing 6 billion floats for easy access in files

I need to save 250 data files an hour with 36000 small arrays of [date, float, float, float] in python, that I can read somewhat easily with PHP. This needs to run for 10 years minimum, on 6tb of storage.
What is the best way to save these individual files, I am thinking python struct. But it starts to look bad for the job with large data amounts?
example of data
a = [["2016:04:03 20:30:00", 3.423, 2.123, -23.243], ["2016:23:.....], ......]
Edit:
Space, is more important than unpacking speed and computation. Since the space is very limiting.
So you have 250 data providers of some kind, which are providing 10 samples per second of (float, float, float).
Since you didn't specify what your limitations are, there are more options.
Binary files
You could write files of fixed array of 3*36000 floats with struct, at 4 bytes each gets you at 432.000 bytes per file. You can encode the hour in the directory name and id of the data provider in file name.
If your data isn't too random, a decent compression algorithm should shave enough bytes, but you would probably need to have some sort of delayed compression if you wouldn't want to lose data.
numpy
An alternative to packing with struct is numpy.tofile, which stores the array directly to file. It is fast, but always stores data in C format, where you should take care if the endian is on target machine is different. With numpy.savez_compressed you can store a number of arrays in one npz archive, and also compress it at same time.
JSON, XML, CSV
A good option is any of the mentioned formats. Also worth mentioning is JSON-lines format, where each line is a JSON encoded record. This is to enable streaming writing, where you keep a valid file format after each write.
They are simple to read, and the syntactic overhead goes away with compression. Just don't do string concatenation, use a real serializer library.
(SQL) Database
Seriously, why not use a real database?
Obviously you will need to do something with the data. With 10 samples per second, no human will need so much data, so you will have to do aggregations: minimum, maximum, average, mean, sum, etc. Databases already have all this and with combination of other features they can save you a ton of time you can otherwise spend on writing oh so many scripts and abstractions over files. Not to mention just how cumbersome the file management becomes.
Databases are extensible and supported by many languages. You save a datetime in database with Python, you read datetime with PHP. No hassles with how you are going to have to encode your data.
Databases support indexes for faster lookup.
My personal favourite is PostgreSQL, which has a number of nice features. It supports BRIN index, a lightweight index, perfect for huge datasets with naturally ordered fields, such as timestamps. If you're low on disk, you can extend it with cstore_fdw, a columnar oriented datastore, which supports compression. And if you still want to use flat files, you can write a foreign data wrapper (also possible with Python) and still use SQL to access the data.
Unless you're consuming the files in the same language, avoid language specific formats and structures. Always.
If you're going between 2 or more languages, use a common, plain text data format like JSON or XML that can be easily (often natively) parsed by most languages and tools.
If you follow this advice and your storing plain text, then use compression on the stored file--that's how you conserve space. Typical well-structured JSON tends to compresses really well (assuming simple text content).
Once again, choose a compression format like gzip that's widely supported in by languages or their core libraries. PHP for example has a native function gzopen() and python has lib\gzip.py in the standard python library.
I doubt it is possible without extremely efficient compression.
6TB / 10 year/ 365 days / 24 hrs / 250 files = 270 KB per file.
In ideal case. In real word size of cluster matters.
If you have 36,000 “small arrays” to fit into each file, you have only 7 bytes per array, which is not enough to store even proper datetime object alone.
One idea that comes to my mind if you want to save space. You better store only values and discard timestamps. Produce files with only data and make sure that you created a kind of index (formula) that given a timestamp (year/month/day/hour/min/sec...) results in the position of the data inside of the file (and of course the file that you have to go for). Even, if you check twice you will discover that if you use an "smart" naming scheme for the files you can avoid to store information about year/month/day/hour, since part of the index could be the file name. That all depends on how do you implement your "index" system, but pushing to an extreme version you could forget about timestamps and focus only in data.
Regarding data format, as aforementioned, I would definitively go on language independent format as XML, JSON... Who know which languages and possibilities will you have in ten years ;)

Best way to encode data

I have huge amount of data in my database whith format :
lat;lon;speed;sec:lat;lon;speed;sec......
for example:
53.284534;50.227268;67;0:53.285481;50.226627;68;6:53.286429;50.226042;66;12:.......
format is lattitude, longitude, speed, number of second from beginning.
length of each string is from 1000 to 100000.
I try to compress it before putting in database via gzcompress() and base64_encode().
In case of length of initial string 7607 symbols after gzcompress and base64_encode it will be 3444,
so compression is 50%.
Is any more effective way to compress strings like this?
There is clearly a strong correlation from sample to sample. I would subtract from each sample the previous sample, except of course for the first one. I would encode each difference as an integer of variable length (not as text but in binary). For lat and long I would multiply by 1,000,000 on the assumption (which you need to verify) that there are never more than six digits after the decimal. The second and third samples would each require only six bytes.
Then would I compress with gzip.
Try just storing them as binary floats. This is very simple and it's very fast.
Each number would use 4 bytes and that would make it possible to use them directly from within your code.
Or if you need them more precise, multiply each component by a pre-definied value (which may differ for each component), and store as 32-bit integer words.

Are AJAX Posts 8 bit Clean? / Relation to Base64 / An alternative? / Where is it?

Base64 only uses 6 bits per character (2^6 = 64) to create textual data from image files. This causes an in-efficiency.
According to a wikipedia entry on Base64, this in-efficiency is to protect against 8 bit dirty things like email.
Is Ajax Posting 8 bit clean? If so, is there an alternative to using Base64?
php.net ( as does wikipedia ) claims a 33% in-efficiency for base64_encode..
Kind of. All JavaScript strings are UTF-16, not byte strings. If you're sending the data with send, then it will be encoded into UTF-8 before it is sent. As such, you can convert the bytes into Unicode code points, which will then be encoded into UTF-8. When it reaches the server, you'll have to decode the UTF-8 and then convert the code points back into bytes.
For 7-bit data, this will not expand the size of the data at all. For 8-bit data with the most significant bit always set, it will double the size of the data. For 8-bit data with the most significant bit set half of the time, it will increase the size of your data by 50%, which is worse than the Base64 33.3͞% increase.
On the other hand, using XMLHttpRequest Level 2 will allows you to send binary data by passing send an ArrayBuffer, Blob, or FormData. However, XMLHttpRequest Level 2 is only supported in newer browsers.
I think AJAX posting is the same as a generic POST requests in that aspect; that's why we need 'multipart/form-data' for sending files' content, for example. Usually the data gets url encoded, but Base64 is perhaps a better way, as it's (generally) more efficient.
UPDATE: It might be helpful to look at this the other way. ) You need some stream of values, that might possibly take all 8 bits, to safely pass the 7-bit filtering. The perfect solution is to use '7-to-8' encoding, so each 7 bytes become 8 'safe' characters. But this is not applicable, as some of these 7-bit characters are actually used to specify some additional (meta) information about the stream...
Now you have a dilemma: either use the next integer (6 bit - that is base64) - or try to invent a scheme with 'non-integer' divider. Such schemes exist (check Ascii85, for example), but they are rarely used.

Storing something as unsigned tinyint in php

Is there a way to explicitly store my numbers in php as tinyint (1 byte instead of 4).
Or could i only enforce this by storing them 4 by 4 in an int? (using a few binary operations)
I generate these values by breaking a string using str_split and interpretting these bytes as ints via unpack( 'C' , .. ).
Currently i store these values in an array as invdividual integers but it could save alot of space if i could store them somehow as tinyints.
PHP has two data types that you may want to use here: integer and string.
PHP doesn't have any other types you could choose from (float wouldn't be a good choice for integers, the other types are not appropriate).
An int is usually 32 or 64 bits, a string is 1 byte per character.* I propose that unless you have a lot of numbers, you won't ever see any problem with 32 bit ints. If you absolutely positively want to safe space memory** and your numbers have a maximum of 3 digits, you could handle your numbers as strings. There's even the BCMath extension that'll let you operate on string numbers directly without needing to cast them back and forth. It's quite a lot of hassle for possibly very limited gain though.
Seeing that a MySQL TINYINT is usually used for boolean values though, please be aware PHP does have a boolean type...!
* One byte per one-byte character, that is.
** Since PHP scripts are usually only very temporary, you should only have problems with peak memory usage, not storage space. Getting more RAM may be the more efficient solution than playing with types.

Storing a bunch of 3bits long binary data with PHP

My PHP program is working with an array of values ranging from 0 to 7. I'm trying to find the most effective way to store those values in PHP. By most effective I mean using the less number of bits.
It's clear that each value only need 3 bits of storage space (b000=0 to b111=7). But what is the most efficient way to store those 3bits values in a binary string ?
I don't know in advance how many 3 bits values I'll need to store or restore, but it might be a lot, so 64bits is clearly not enough.
I was looking into pack() and unpack(): I could store two values in each byte and use a pack('C', $twoValues), but I'm still loosing 2 bits.
Will it work ? Is there a more effective way of storing those values ?
Thanks
You didn't ask if it was a good idea - as many suggested, your benefit of that kind of space compression, is easily lost in the extra processing - but that's another topic :)
You're also not mentioning where you're storing the data after. Whatever that storage location/engine is maybe have further conditions and specialized types (eg a database has a binary column format, might have a byte column format, may even support bit storage etc).
But sticking with the topic, I guess best 3 bit storage is as a nibble (waisting one bit) and I suppose I'd combine two nibbles into a byte (loosing two bits overall). Yes you're loosing two bits (if that's key), but it's simple to combine the two values so you're processing overhead is relatively small:
$byte=$val1*7+$val2;
$val2=$byte%7;$val1=($byte-$val2)/7;
If a byte isn't available, you can combine these up to make 16 (4 stored), 32 (8), 64 (16) bit integers. You can also form an array of these values for larger storage.
I'd consider the above more human readable, but you could also use bit-logic to combine and separate the values:
$combinedbyte=$val1<<3|$val2;
$val2=$combinedbyte&7;$val1=($combinedbyte&56)>>3);
(This is effectively what the PACK/UNPACK commands do)
Alternatively you could encode into characters, since in ASCII the first few are protected, you might as well start at A (A-Z+6 punc+a-z gives you 58 when you only need 49 to store your two values).
$char=chr(($val1*7+$val2)+65); //ord('A')=65
$val2=(ord($char)-65)%7;$val1=(ord($char)-65-$val2)/7;
A series of these encoded characters could be stored as an array or in a null terminated string.
NOTE:
In the case of -say- 64 bit integers above, we're storing 3 bits in 4 so get 64/4=16 storage locations. This means we're waisting 16 further bits (1 per location) so you might be tempted to add another 5 values, for a total of 21 (21*3=63 bits, only 1 wasted). That's certainly possible (with integer math - although most PHP instances don't work # 64 bits, or bit-logic solutions) but it complicates things in the long run - probably more trouble than it's worth.
The best way is to store them as integers and not get involved with packing things bit by bit. Unless you have an actual engineering reason you need these to be stored as 3-bit values (for example, interfacing with hardware), you're just asking for headaches. Keep in mind, esp for odd bit sizes, they become pretty difficult to have direct access to if you do this. And if you are sticking these values in a database, you wouldnt be able to search or index on values packed like this. Store them as integers, or if in a db, perhaps a short integer or byte.
That kind of technique is only necessary if you will have at least half a billion of these. Think about it, the CPU will have to have data in one register, the mask in another and AND them just to get your value out. Now imagine iterating over a list of these that is long enough to justify that kind of space saving technique. A 50% reduction in space and an order of magnitude slower.
Looking at http://php.net/manual/en/language.types.php, you should store them as integers. However, the question is whether to let one integer value represent many 3-bit values or not. The former is more complex but requires less memory, whereas the first is the opposite. If you don't have an extreme need to reduce the amount of memory you use, then I would suggest the latter (use one integer for one 3-bit value).
The main problem with storing many 3-bit values in one integer is figuring out how many 3-bit values there are. You could use an array of integers, and then have an extra integer which states the total number of 3-bit values. However, as also stated in the manual, the number of bits used for an integer value is platform-dependent. So you would have to know whether an integer is 32 bits or 64 bits, or else you may try to store too many values and lose data, or you risk using more memory than needed (which would be a bad thing as you're aiming to use as little memory in the first place).
I would convert each integer to binary, concatenate all of them, and then split the resulting string into bytes. Each byte will be 0-255 so it can be stored as an individual character.

Categories