Large amounts of text - mysql or flatfile?

Large amounts of text - mysql or flatfile? - php

I am writing a web application in PHP that will store large numbers of blocks of arbitrary length text. Is MySQL well suited for this task with a longtext field or similar, or should I store each block of text in its own file and use a MySQL table for indexes and filenames? Think online bulletin board type stuff, like how you would store each users posts.

Yes, MySQL is the way to go. A flat file would take much longer to search etc.
Mysql all the way. Much more efficient.

Related

Storing 6 billion floats for easy access in files

I need to save 250 data files an hour with 36000 small arrays of [date, float, float, float] in python, that I can read somewhat easily with PHP. This needs to run for 10 years minimum, on 6tb of storage.
What is the best way to save these individual files, I am thinking python struct. But it starts to look bad for the job with large data amounts?
example of data
a = [["2016:04:03 20:30:00", 3.423, 2.123, -23.243], ["2016:23:.....], ......]
Edit:
Space, is more important than unpacking speed and computation. Since the space is very limiting.

So you have 250 data providers of some kind, which are providing 10 samples per second of (float, float, float).
Since you didn't specify what your limitations are, there are more options.
Binary files
You could write files of fixed array of 3*36000 floats with struct, at 4 bytes each gets you at 432.000 bytes per file. You can encode the hour in the directory name and id of the data provider in file name.
If your data isn't too random, a decent compression algorithm should shave enough bytes, but you would probably need to have some sort of delayed compression if you wouldn't want to lose data.
numpy
An alternative to packing with struct is numpy.tofile, which stores the array directly to file. It is fast, but always stores data in C format, where you should take care if the endian is on target machine is different. With numpy.savez_compressed you can store a number of arrays in one npz archive, and also compress it at same time.
JSON, XML, CSV
A good option is any of the mentioned formats. Also worth mentioning is JSON-lines format, where each line is a JSON encoded record. This is to enable streaming writing, where you keep a valid file format after each write.
They are simple to read, and the syntactic overhead goes away with compression. Just don't do string concatenation, use a real serializer library.
(SQL) Database
Seriously, why not use a real database?
Obviously you will need to do something with the data. With 10 samples per second, no human will need so much data, so you will have to do aggregations: minimum, maximum, average, mean, sum, etc. Databases already have all this and with combination of other features they can save you a ton of time you can otherwise spend on writing oh so many scripts and abstractions over files. Not to mention just how cumbersome the file management becomes.
Databases are extensible and supported by many languages. You save a datetime in database with Python, you read datetime with PHP. No hassles with how you are going to have to encode your data.
Databases support indexes for faster lookup.
My personal favourite is PostgreSQL, which has a number of nice features. It supports BRIN index, a lightweight index, perfect for huge datasets with naturally ordered fields, such as timestamps. If you're low on disk, you can extend it with cstore_fdw, a columnar oriented datastore, which supports compression. And if you still want to use flat files, you can write a foreign data wrapper (also possible with Python) and still use SQL to access the data.

Unless you're consuming the files in the same language, avoid language specific formats and structures. Always.
If you're going between 2 or more languages, use a common, plain text data format like JSON or XML that can be easily (often natively) parsed by most languages and tools.
If you follow this advice and your storing plain text, then use compression on the stored file--that's how you conserve space. Typical well-structured JSON tends to compresses really well (assuming simple text content).
Once again, choose a compression format like gzip that's widely supported in by languages or their core libraries. PHP for example has a native function gzopen() and python has lib\gzip.py in the standard python library.

I doubt it is possible without extremely efficient compression.
6TB / 10 year/ 365 days / 24 hrs / 250 files = 270 KB per file.
In ideal case. In real word size of cluster matters.
If you have 36,000 “small arrays” to fit into each file, you have only 7 bytes per array, which is not enough to store even proper datetime object alone.

One idea that comes to my mind if you want to save space. You better store only values and discard timestamps. Produce files with only data and make sure that you created a kind of index (formula) that given a timestamp (year/month/day/hour/min/sec...) results in the position of the data inside of the file (and of course the file that you have to go for). Even, if you check twice you will discover that if you use an "smart" naming scheme for the files you can avoid to store information about year/month/day/hour, since part of the index could be the file name. That all depends on how do you implement your "index" system, but pushing to an extreme version you could forget about timestamps and focus only in data.
Regarding data format, as aforementioned, I would definitively go on language independent format as XML, JSON... Who know which languages and possibilities will you have in ten years ;)

Raw Text as a Substitute for MySQL

I am renting a server which does not support MySQL. An upgrade would be significantly expensive.
So for the moment, I am trying to cope with using raw text.
So here is a database syntax I am thinking about:
{first line} metadata (id, name, date, number of rows, number of columns, etc.)
{second line} column headers
{rest of lines} column data, separated by a deliminator
Example (using * as a deliminator):
rmC2xA7f*Users*1436703535*3*5
id*first*last*email*password
d29JHVca*Example*User*example.user#example.com*examplepassword123
tGpy3CM6*Foo*Bar*foo.bar#foobar.com*foobarpassword456
PdQMDHsK*Bla*Bla*bla.bla#bla.com*blablapassword789
I would then create a PHP library for manipulating this text. I know that it wouldn't be as efficient, scalable or fast a MySQL, but would this be an acceptable substitute for a small, personal website?
Are there any issues with it, or any way I could improve it? I'll probably change the * to something else if you're thinking that.
Also, comment if this question should be on a different network...
Thanks :).

Compacting string data for storage and retrieval

I have some text data I would like to store in a mysql database. I currently have the data stored in a variable as a string.
I'm concerned that the table will become quite large due to the amount of text data I have for each row.
Therefore, what is the most easiest way (preferably php built in functions) of compacting this string data in a format ideal for storage and retrieval?

You could GZIP the string with GZEncode.
That's pretty standard and thus should be reversible from other languages if you want to.
I would advise storing a Base64 version of the result.

If you're using InnoDB you can enable compression on entire tables which doesn't impact your code at all.
ALTER TABLE database.tableName ENGINE='InnoDB' ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
You can alter the KEY_BLOCK_SIZE to smaller values to get more compression (depending on the data), but this adds more overhead to the CPU.
After testing a range of tables, I found a KEY_BLOCK_SIZE of 8 to be a good balance of compression vs performance.

Storing text in db: how to choose varchar size (considering formatting), storing formatting separately?

How to best choose a size for a varchar/text/... column in a (mysql) database (let's assume the text the user can type into a text area should be max 500 chars), considering that the user also might use formatting (html/bb code/...), which is not visible to the user and should not affect the max 500 chars text size...??
1) theoretically, to prevent any error, the varchar size has to be almost endless, if the user e.g. uses 20 links like this (http://[huge number of chars]) or whatever... - or not?
2) should/could you save formatting in a separate column, to e.g. not give an index (like FULLTEXT) wrong values (words that are contained in formatting but not in the real text)?
If yes, how to best do this? do you remember at which point the formatting was used, save this point and the formatting and when outputting put this information together?
(php/mysql, java script, jquery)
Thank you very much in advance!

A good solution is to consider in the amount of formatting characters.
If you do not, to avoid data loss, you need to use much more space for the text on the database and check the length of prior record before save or use full text.
Keep the same data twice in one table is not a good solution, it all depends on your project, but usually better it's filter formating on php.

MySQL VARCHAR vs TEXT for various tables of user inputs

All,
I'm writing a web app that will receive user generated text content. Some of those inputs will be a few words, some will be several sentence long. In more than 90% of cases, the inputs will be less than 800 characters. Inputs need to be searchable. Inputs will be in various character sets, including Asian. The site and the db are based on utf8.
I understand roughly the tradeoffs between VARCHAR and TEXT. What I am envisioning is to have both a VARCHAR and a TEXT table, and to store inputs on one or the other depending on their size (this should be doable by the PHP script).
What do you think of having several tables for data based on its size? Also, would it make any sense to create several VARCHAR tables for various size ranges? My guess is that I will get a large number of user inputs clustered around a few key sizes.
Thanks,
JDelage

Storing values in one column vs another depending on size of input is going to add a heck of a lot more complexity to the application than it'll be worth.
As for VARCHAR vs TEXT in MySQL, here's a good discussion about that, MySQL: Large VARCHAR vs TEXT.
The "tricky" part is doing a full-text search on this field which requires the use of MyISAM storage engine as it's the only one that supports full-text indexes. Also of note is that sometimes at the cost of complicating the system architecture, it might be worthwhile to use something like Apache Solr as it perform full-text search much more efficiently. A lot of people have most of the data in their MySQL database and use something like Solr just for full-text indexing that text column and later doing fancy searches with that index.
Re: Unicode. I've used Solr for full-text indexing of text with Unicode characters just fine.

Comments are correct. You are only adding 1 byte by using the TEXT datatype over VARCHAR.
Storage Requirements:
VARCHAR Length of string + 1 byte
TEXT Length of string + 2 bytes

The way I see it is you have two options:
Hold it in TEXT, it will waste single additional byte on storage and additional X processing power on search.
Hold it in VARCHAR, create additional table named A_LOT_OF_TEXT with the structure of (int row_id_of_varchar_table, TEXT). If the data is small enough, put it in varchar, otherwise put a predefined value instead of data, for example 'THE_DATA_YOU_ARE_LOOKING_FOR_IS_IN_TABLE_NAMED_A_LOT_OF_TEXT' or just simply NULL and put the real data to table A_LOT_OF_TEXT.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.