Compacting string data for storage and retrieval

Compacting string data for storage and retrieval - php

I have some text data I would like to store in a mysql database. I currently have the data stored in a variable as a string.
I'm concerned that the table will become quite large due to the amount of text data I have for each row.
Therefore, what is the most easiest way (preferably php built in functions) of compacting this string data in a format ideal for storage and retrieval?

You could GZIP the string with GZEncode.
That's pretty standard and thus should be reversible from other languages if you want to.
I would advise storing a Base64 version of the result.

If you're using InnoDB you can enable compression on entire tables which doesn't impact your code at all.
ALTER TABLE database.tableName ENGINE='InnoDB' ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
You can alter the KEY_BLOCK_SIZE to smaller values to get more compression (depending on the data), but this adds more overhead to the CPU.
After testing a range of tables, I found a KEY_BLOCK_SIZE of 8 to be a good balance of compression vs performance.

Related

How to cache BLOB type in Laravel

Hi everyone,
Is there a way to cache BLOB types temporarily in Laravel ?
Scenario:
I'm gonna cache some data MEDIUMBLOB with the size of 2048KB temporarily.
These data are actually parts of a large single file 16MB.
After caching all parts, they will be combined together into a single file, then will be removed from cache.
The content of each single part is given by file_get_contents function.
I'm already doing this with MySQL. (However, there are lots of queries and takes time to be done.)
Is there a better way to store MEDIUMBLOB data temporarily in Cache storage ?
I've faced with Redis and Cache in Laravel, but I'm not sure they support MEDIUMBLOB.

Here's the main point. In the end a BLOB is just a string of bytes. The major difference with a formal string type is that a BLOB doesn't have any sort of encoding or collation associated with it. It's a kind of binary string which is another way of saying "generic data". file_get_contents also generally returns this sort of "generic data".
Laravel's cache is build to be generic like that so it doesn't have any specific data type associated with it. It's just a store of key/value pairs and the keys must be ascii strings while the values can be anything Laravel serialises things before they go in the cache so basically anything that can be represented as a variable in PHP and is serializable can be cached.

Storing 6 billion floats for easy access in files

I need to save 250 data files an hour with 36000 small arrays of [date, float, float, float] in python, that I can read somewhat easily with PHP. This needs to run for 10 years minimum, on 6tb of storage.
What is the best way to save these individual files, I am thinking python struct. But it starts to look bad for the job with large data amounts?
example of data
a = [["2016:04:03 20:30:00", 3.423, 2.123, -23.243], ["2016:23:.....], ......]
Edit:
Space, is more important than unpacking speed and computation. Since the space is very limiting.

So you have 250 data providers of some kind, which are providing 10 samples per second of (float, float, float).
Since you didn't specify what your limitations are, there are more options.
Binary files
You could write files of fixed array of 3*36000 floats with struct, at 4 bytes each gets you at 432.000 bytes per file. You can encode the hour in the directory name and id of the data provider in file name.
If your data isn't too random, a decent compression algorithm should shave enough bytes, but you would probably need to have some sort of delayed compression if you wouldn't want to lose data.
numpy
An alternative to packing with struct is numpy.tofile, which stores the array directly to file. It is fast, but always stores data in C format, where you should take care if the endian is on target machine is different. With numpy.savez_compressed you can store a number of arrays in one npz archive, and also compress it at same time.
JSON, XML, CSV
A good option is any of the mentioned formats. Also worth mentioning is JSON-lines format, where each line is a JSON encoded record. This is to enable streaming writing, where you keep a valid file format after each write.
They are simple to read, and the syntactic overhead goes away with compression. Just don't do string concatenation, use a real serializer library.
(SQL) Database
Seriously, why not use a real database?
Obviously you will need to do something with the data. With 10 samples per second, no human will need so much data, so you will have to do aggregations: minimum, maximum, average, mean, sum, etc. Databases already have all this and with combination of other features they can save you a ton of time you can otherwise spend on writing oh so many scripts and abstractions over files. Not to mention just how cumbersome the file management becomes.
Databases are extensible and supported by many languages. You save a datetime in database with Python, you read datetime with PHP. No hassles with how you are going to have to encode your data.
Databases support indexes for faster lookup.
My personal favourite is PostgreSQL, which has a number of nice features. It supports BRIN index, a lightweight index, perfect for huge datasets with naturally ordered fields, such as timestamps. If you're low on disk, you can extend it with cstore_fdw, a columnar oriented datastore, which supports compression. And if you still want to use flat files, you can write a foreign data wrapper (also possible with Python) and still use SQL to access the data.

Unless you're consuming the files in the same language, avoid language specific formats and structures. Always.
If you're going between 2 or more languages, use a common, plain text data format like JSON or XML that can be easily (often natively) parsed by most languages and tools.
If you follow this advice and your storing plain text, then use compression on the stored file--that's how you conserve space. Typical well-structured JSON tends to compresses really well (assuming simple text content).
Once again, choose a compression format like gzip that's widely supported in by languages or their core libraries. PHP for example has a native function gzopen() and python has lib\gzip.py in the standard python library.

I doubt it is possible without extremely efficient compression.
6TB / 10 year/ 365 days / 24 hrs / 250 files = 270 KB per file.
In ideal case. In real word size of cluster matters.
If you have 36,000 “small arrays” to fit into each file, you have only 7 bytes per array, which is not enough to store even proper datetime object alone.

One idea that comes to my mind if you want to save space. You better store only values and discard timestamps. Produce files with only data and make sure that you created a kind of index (formula) that given a timestamp (year/month/day/hour/min/sec...) results in the position of the data inside of the file (and of course the file that you have to go for). Even, if you check twice you will discover that if you use an "smart" naming scheme for the files you can avoid to store information about year/month/day/hour, since part of the index could be the file name. That all depends on how do you implement your "index" system, but pushing to an extreme version you could forget about timestamps and focus only in data.
Regarding data format, as aforementioned, I would definitively go on language independent format as XML, JSON... Who know which languages and possibilities will you have in ten years ;)

MySQL VARCHAR vs TEXT for various tables of user inputs

All,
I'm writing a web app that will receive user generated text content. Some of those inputs will be a few words, some will be several sentence long. In more than 90% of cases, the inputs will be less than 800 characters. Inputs need to be searchable. Inputs will be in various character sets, including Asian. The site and the db are based on utf8.
I understand roughly the tradeoffs between VARCHAR and TEXT. What I am envisioning is to have both a VARCHAR and a TEXT table, and to store inputs on one or the other depending on their size (this should be doable by the PHP script).
What do you think of having several tables for data based on its size? Also, would it make any sense to create several VARCHAR tables for various size ranges? My guess is that I will get a large number of user inputs clustered around a few key sizes.
Thanks,
JDelage

Storing values in one column vs another depending on size of input is going to add a heck of a lot more complexity to the application than it'll be worth.
As for VARCHAR vs TEXT in MySQL, here's a good discussion about that, MySQL: Large VARCHAR vs TEXT.
The "tricky" part is doing a full-text search on this field which requires the use of MyISAM storage engine as it's the only one that supports full-text indexes. Also of note is that sometimes at the cost of complicating the system architecture, it might be worthwhile to use something like Apache Solr as it perform full-text search much more efficiently. A lot of people have most of the data in their MySQL database and use something like Solr just for full-text indexing that text column and later doing fancy searches with that index.
Re: Unicode. I've used Solr for full-text indexing of text with Unicode characters just fine.

Comments are correct. You are only adding 1 byte by using the TEXT datatype over VARCHAR.
Storage Requirements:
VARCHAR Length of string + 1 byte
TEXT Length of string + 2 bytes

The way I see it is you have two options:
Hold it in TEXT, it will waste single additional byte on storage and additional X processing power on search.
Hold it in VARCHAR, create additional table named A_LOT_OF_TEXT with the structure of (int row_id_of_varchar_table, TEXT). If the data is small enough, put it in varchar, otherwise put a predefined value instead of data, for example 'THE_DATA_YOU_ARE_LOOKING_FOR_IS_IN_TABLE_NAMED_A_LOT_OF_TEXT' or just simply NULL and put the real data to table A_LOT_OF_TEXT.

storing gziped strings in mysql

I'm referencing this Store GZIP:ed text in mysql?.
I want to store serialized sessions in the database (they are actually stored in a memcached pool but i have this as a failsafe). I am gziping/uncompressing from php.
I want to ask the following:
1) Is this a good move? I am doing this to avoid using mediumtext as the data may be bigger than text. I think/hope i will have a lot of sessions stored there. Is it, in this case, worth to gzip? Table is MyISAM.
2) Do i need to set the encoding of the table field to binary? Or only do that if i have a complete gziped file?
3) Is serializing a bad move, should i use json_encode instead (because of the smaller size i guess)?
Thanks,

You should use a MEDIUMBLOB field instead of MEDIUMTEXT. BLOBs have no encoding, as they are raw byte streams.

Large amounts of text - mysql or flatfile?

I am writing a web application in PHP that will store large numbers of blocks of arbitrary length text. Is MySQL well suited for this task with a longtext field or similar, or should I store each block of text in its own file and use a MySQL table for indexes and filenames? Think online bulletin board type stuff, like how you would store each users posts.

Yes, MySQL is the way to go. A flat file would take much longer to search etc.
Mysql all the way. Much more efficient.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.