storing large data in files vs tables

storing large data in files vs tables - php

So I am working on this website where people can post articles. My colleague suggested storing all the meta-data of an article (user, title, dates etc) in a table and the actual article body as a file in the server.
A data structure would look like:
post_id post_user_id post_title post_body post_date etc
-------------------------------------------------------------------------------
1 1 My First Post 1_1.txt 2014-07-07 ...
2 1 My First Post 2_1.txt 2014-07-07 ...
--------------------------------------------------------------------------------
Now we would get the record of the post and then locate where it is by
$post_id . "_" . $post_user_id . ".txt";
He says that this will reduce the size of the tables and on the long run make it faster to access. I am not sure about this and wanted to ask if there are any issues in this design.

The first risk that pops into my mind is data corruption. Following the design, you are splitting the information into two fragments, even though both pieces are dependant from one another :
A file has to exist for each metadata entry (or you'll end up with not found errors for entries supposed to exist).
A metadata entry has to exist for each file (or you'll end up with garbage).
Using the database only has one big advantage : it is very probably relational. This means that you actually can set up rules to prevent the two scenarios above to occur (you could use an SQL CASCADE DELETE for instance, or put every piece of information in one table). Keeping these relations between two data backends is going to be a tricky thing to setup.
Another important thing to remember : data stored in a SQL database isn't sent to a magical place far from your drive. When you add an entry into your database, you write to your database files. For instance, those files are stored in /var/lib/mysql for MySQL engines. Writing to other files does not make that much of a difference...
Next thing : time. Accessing a database is fast once it's opened, all it takes is query processing. Accessing files (and that is, once per article) may be heavier : files need to be opened (including privileges checks, ...), read (line-by-line according to your buffer size) and closed. Of course, you can add to that all the programming it would take to link those files to their metadata...
To me, this design adds unecessary complexity to the application. You could store everything in the database, centralise. You'll use pretty much the same amount of disk space in both cases, yet looking-up/accessing each article file separately (while keeping it connected with its DB metadata) will certainly waste some time.
Design for simplicity; add complexity only where you must. (Eric S. Raymond)

This could look like a good idea is those posts are NEVER edited. Access to a file could take a while, and if your user wants to edit a lot of times his post, storing the content in a file is not a great idea. SQL support well large text values (as WYSIWYG text), dont be afraid to store them in your Post table.
Additionally, your filesystem will take ways more time to read and writes datas stored in files than in database.
Everything will depend of the number of post you want to store, and if you users can edit or not their posts.

I would agree, in a production environment it is generally recommended to let the file system keep track of files and the database to hold on to the metadata.
However, I have mostly heard this philosophy be applicable to BLOG types and Images. Because even large articles are relatively small, a TEXT data type can suffice and even make it easier to edit, draw from, and search as needed. \
(hence I agree with Rémi Delhaye, who answered this just as I was writing this post)

Filesystem is much more likely to have higher latency and files can 'go missing' where a database record is less likely to.
If the contents of the field is too large in the case of SQL Server then you could look at the FileStream API in newer versions.
Really though, either approach is valid in my opinion. With a file you don't have to worry about the database mangling the content if you make a mistake during escaping or something.
Beware if you're writing your code on a case-insensitive filesystem and running on a case-sensitive one in production- filename case matters so it can be another way to lose access to your files later on or unexpectedly once the application is deployed.

Related

XML or SQL - Only Reading data frequently

I have a PHP function, that returns 'true' or 'false' based on if a particular data exists in Database/XML
This search query will be performed on each page load of the website.
Which option will be better- SQL DB or data stored in XML.
Which will have less load and effect on server performance due to the high number of page loads/visits.
I don't Need to modify or Insert data, It's just searching.
Thank you
------- update --------
Another way, I was thinking is to store all information ( strings ) in form of file name. [eg- a directory having 1000 files with different names ]
Now I can search for a particular file (eg- 457.txt ) in the directory, to see if that file exists.
<?php
if(file_exists('457.txt'){
echo "file exists";
}
How much load will it be on the server, compared to XML query & SQL query?
Main aim to see, if a particular data already exist on the server.

If you have reason to be concerned about user/system load, I would argue against the XML file solution but not for the reason you think of. It's more about concurrency. What would happen if several processes were to update the file at the same time ? Writing to a file raises concerns about locking issues you might encounter soon enough.
The file name idea is interesting but will probably scale badly, and not allow you to perform much more than single item existence search with acceptable performance, unless you dive deep into OS, FS and storage layer intricacies. It all depends on the volume of data you intend to store and how you intend to access it.
All in one, the database is the most versatile storage solution for these kind of needs, and allows you to perform queries using an easy, explicit language, among many other advantages (ACID transactions [ https://en.wikipedia.org/wiki/ACID ], scalability...).

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Is it faster or better to use MySQL instead of text files or file names for order of images with PHP?

I have images being stored in folders related to articles on my PHP web site, and would like to set the order to display the images based on author input. I started by naming the files with a number in front of them, but was considering recording the order in a text file instead to avoid renaming every file and retaining their original file names, or possibly storing the order in a MySQL table.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them. I was hoping to get some input about which method would be best and fastest.
For example, is it much slower to read a list of file names in a folder with PHP, or open a text file and read the contents, compared to making MySQL query and update statements?

I'd say a lot depends on your base hardware/filesystem/mysql connection performances. A single access to disk, just to read images is most likely going to be your quickest option. But you'd need to name your files manually ahead.
Mysql requires a TCP or *NIX socket connection, and this might slow down things (a lot depends on the number of pictures you have, and the "quality" of your db link, though). If you have a lot of files, performance hit might be negligible. Just reading from a file might be quicker nevertheless, without bothering to set up a DB connection; you'd still have to write down ID/filename correspondence for the ordering though.
Something I'd try out in your situation is to take a look at the php stat command, and see if it can help you out sorting the pictures. Depending on the number of pictures you have (it works better with lower numbers), performance might not get a serious performance hit, and you'd be able NOT to keep a separate list of picture/creation date tuples. As your number of pictures grow, the file list approach seems to me like a reasonable way to solve the problem. Just benchmarking the thing as the number of pictures increases can tell you the truth, though. Since, I think, you can expect to have lot of variability, depending on your specific context.

if your concern is performance why don't you save the list (maybe already formatted in HTML) to a file. When your page is loaded just read the file with
$code = file_get_contents("cache_file.html")
and output to the user. The fastest solution is to store the file as .html and let apache serve it directly, but this works only if your page doesn't have any other dinamic part.
to ensure that your cache file is up to date you can make it invalid and recreate it after some time (the specific time depends from the frequency in image changes) or check if the directory is changed after the cache file creation date. If you can trigger the changes in the image directory (for example if the changes are made from a piece of code that you wrote you can always ensure that you cache is refreshed when the images are changed)
Hope this helps

This smells like premature optimization.
My question is about best practice and speed - every time the article is loaded, it will have to find out the order of images to display them.
So? A query like "select filename, title from images where articleId=$articleId order by 'order'" will execute within a fraction of a second. It really doesn't matter. Do whatever is the easiest to do, and might I suggest that being the SQL option.

imho, using mysql would be slower, but oh so much easier. if the mysql server is hosted on the same server (or within dedicated space on the same server, like cloud linux), then it probably wont save too much time
edit
if you want to do a test, you can use the microtime function to time exactly how long it takes to append and sort the files, and how long it takes to get it all from mysql

If I have the choice, should webpage contents be saved on the file system or in MySQL?

I am in the planning stages of writing a CMS for my company. I find myself having to make the choice between saving page contents in a database or in folders on a file system. I have learned that PHP performs admirably well reading and writing to file systems, way better in fact than running SQL queries. But when it comes to saving pages and their data on a file system, there'll be a lot more involved than just reading and writing. Since pages will be drawn using a PHP class, the data for each page will be just data, no HTML. Therefore a parser for the files would have to be written. Also I doubt that all the data from a page will be saved in just one file, it would rather be saved in one directory, with content boxes and data in separated files.
All this would be done so much easier with MySQL, so what I want to ask you experts:
Will all the extra dilly dally with file system saving outweigh it's speed and resource advantage over MySQL?
Thanks for your time.

Go for MySQL. I'd say the only time you should think about using the file system is when you are storing files (BLOBS) of several megabytes, databases (at least the ones you typically use with a php website) are generally less performant when storing that kind of data. For the rest I'd say: always use a relational database. (Assuming you are dealing with data dat has relations of course, if it is random data there is not much benefit in using a relational database ;-)
Addition: If you define your own file-structure, and even your own way of cross referencing files you've already started building a 'database' yourself, that is not bad in itself -- it might be loads of fun! -- but you probably will not get the performance benefits you're looking for unless your situation is radically different than the other 80% of 'standard' websites on the web (a couple of pages with text and images on them). (If you are building google/youtube/flickr/facebook ... you've got a different situation and developing your own unique storage solution starts making sense)

things to consider
race-condition in file write if two user editing same piece of content
distribute file across multiple servers if CMS growth, latency on replication will cause data integrity problem
search performance, grep on files on multiple directory will be very slow
too many files in same directory will cause server performance especially in windows

Assuming you have a low-traffic, single-server environment here…
If you expect to ever have to manage those entries outside of the CMS, my opinion is that it's much, much easier to do so with existing tools than with database access tools.
For example, there's huge value in being able to use awk, grep, sed, sort, uniq, etc. on textual data. Proxying that through a database makes this hard but not impossible.
Of course, this is just opinion based on experience.
S

Storing Data on the filesystem may be faster for large blobs that are always accessed as one piece of information. When implementing a CMS, you typically don't only have to deal with such blobs but also with structured information that has internal references (like content fields belonging to a certain page that has links to other pages...). SQL-Databases provide an easy way to access structured information, files on your filesystem do not (except of course simple hierarchical structures that can be represented with folders).
So if you wanted to store the structured data of your cms in files, you'd have to use a file format that allows you to save the internal references of your data, e.g. XML. But that means that you would have to parse those files, which is not only a lot of work but also makes the process of accessing the data slow again.
In short, use MySQL

Use a database and you have lots of important properties from the beginning "for free" without inventing them in some suboptimal ways if you go the filesystem way. If you don't want to be constrained to MySQL only you can make use of e.g. the database abstraction layer of the doctrine project.
Additionally you have tools like phpMyAdmin for easy lookup or manipulation of your data versus the texteditor.
Keep in mind that the result of your database queries can almost always be cached in memory or even in the filesystem so you have the benefit of easier management with well known tools and similar performance.

When it comes to minor modifications of website contents (eg. fixing a typo or updating external links), I find it much easier to connect to the server using SSH and use various tools (text editors, grep etc.) on files, rather than I having to use CMS interface to update each file manually (our CMS has such interface).
Yet there are several questions to analyze and answer, mentioned above - do you plan for scalability, concurrent modification of data etc.

No, it will not be worth it.
And there is no advantage to using the filesystem over a database unless you are the only user on the system (in which the advantage would be lost anyway). As soon as the transactions start rolling in and updates cascades to multiple pages and multiple files you will regret that you didn't used the database from the beginning :)
If you are set on using caching, experiment with some of the existing frameworks first. You will learn a lot from it. Maybe you can steal an idea or two for your CMS?

PHP to store images in MySQL or not?

I have built a small web application in PHP where users must first log in. Once they have logged in, I intend on showing a small thumbnail as part of their "profile".
I will have to ensure the image is below a particular size to conserve space, or ensure it is a particular resolution, or both, or even perhaps use something like image magick to scale it down.
Not sure what the best approach for that is yet, any ideas welcome.
Also, I have been trying to work out if it is better to store the image in the users table of MySQL as a blob, or maybe a separate images table with a unique id, and just store the appropriate image id in the users table, or simply save the uploaded file on the server (via an upload page as well) and save the file as theUsersUniqueUsername.jpg.
Best option?
I found a tutorial on saving images to mysql here:
http://www.phpriot.com/articles/images-in-mysql
I am only a hobby programmer, and haven't ever done anything like this before, so examples, and/or a lot of detail is greatly appreciated.

Always depends of context, but usually, I store a user image on the filesystem in a folder called /content/user/{user_id}.jpg and try to bother the database as little as possible.

I would recommend storing the image as a file and then have the file URI in the database. If you store all the images in the database, you might have some problems with scaling at a later date.
Check out this answer too:
Microsoft's advice for SQL Server used to be, for speed and size, store images in the file system, with links in the database. I think they've softened their preference a bit, but I still consider it a better idea certainly for size, since it will take up no space in the database.

The overhead using BLOB is a lot less than most people would have you believe, especially if you set it up right. If you use a separate server just running the DB to store binary files then you can in fact use no file-system at all and avoid any overhead from the file-system
That said the easiest/best way unless you have a couple of servers to yourself is storing them in the filesystem
Do not store the absolute URL of the file in your DB, just the unique part (and possibly a folder or two), e.g. 2009/uniqueImageName.jpg or just uniqueImageName.jpg.
Then in your pages just add the host and other folders onto the front, that way you have some flexibility in moving your images - all you'll need to change is a line or two in your PHP/ASP.NET page.
There is no need to store outside the document root for security - a .htaccess file with DENY FROM ALL will work the same and provide more flexibility
No need to 'shunt' images so much for security, just have a getImage.php page or something, and then instead of inserting the actual URL in the src of the image, use something like getImage.php?file=uniqueImageName.jpg.
Then the getImage.php file can check if the user is authorised and grab the image (or not).
Use a name which is guaranteed to be unique (preferably an integer i.e. primary key) when storing, some file-system (i.e. Windows) are case-insensitive, so JoeBloggs.jpg and joebloggs.jpg are unique for the database, but not for the file-system so one will overwrite another.
Use a separate table for the images, and store the primary key of the image in the users table. If you ever want to add more fields or make changes in future it will be easier - it's also good practice.
If you are worried about SEO and things like that, store the image's original file name in another field when you are uploading, you can then use this in your output (such as in the alt tag).

Challenging the Conventional Wisdom!
Of course it is context dependent, but I have a very large application with thousands of images and documents stored as BLOBS in a MySQL database (average size=2MB) and the application runs fine on a server with 256MB of memory. The secret is correct database structure. Always keep two separate tables, one of which stores the basic information about the file, and the other table should just contain the blob plus a primary key for accessing it. All basic queries will be run against the details table, and the other table is only access when the file is actually needed, and it is accessed using an indexed key so performance is extremely good.
The advantages of storing files in the database are multiple:
Much easier backup systems are required, as you do not need to back up the file system
Controlling file security is much easier as you can validate before releasing the binary (yes, you can store the file in a non-public directory and have a script read and regurgitate the file, but performance will not be noticeably faster.
(Similar to #1) It cleanly separates "user content" and "system content", making migrations and cloning easier.
Easier to manage files, track/store version changes, etc, as you need fewer script modifications to add version controls in.
If performance is a big issue and security and backups aren't (or if you have a good fs backup system) then you can store it the the FS, but even then I often store files (in the case of images) in the DB and building a caching script that writes the image to a cache folder after the first time it's used (yes, this uses more HD space, but that is almost never a limiting factor).
Anyway, obviously FS works well in many instances, but I personally find DB management much easier and more flexible, and if written well the performance penalties are extremely small.

We created a shop that stored images in the DB. It worked great during development but once we tested it on the production servers the page load time was far too high, and it added unneccessary load to the DB servers.
While it seems attractive to store binary files in the DB, fetching and manipulating them adds extra complexity that can be avoided by just keeping files on the file system and storing paths / metadata in the DB.
This is one of those eternal debates, with excellent arguments on both sides, but for my money I would keep images away from the DB.

I recently saw this tip's list: http://www.ajaxline.com/32-tips-to-speed-up-your-mysql-queries
Tip 17:
For your web application, images and other binary assets should normally be stored as files. That is, store only a reference to the file rather than the file itself in the database.
So just save the file path to the image :)

I have implemented both solutions (file system and database-persisted images) in previous projects. In my opinion, you should store images in your database. Here's why:
File system storage is more complicated when your app servers are clustered. You have to have shared storage. Even if your current environment is not clustered, this makes it more difficult to scale up when you need to.
You should be using a CDN for your static content anyways, and set your app up as the origin. This means that your app will only be hit once for a given image, then it will be cached on the CDN. CloudFront is dirt cheap and simple to set up...there's no reason not to use it. Save your bandwidth for your dynamic content.
It's much quicker (and thus cheaper) to develop database persisted images
You get referential integrity with database persisted images. If you're storing images on the file system, you will inevitably have orphan files with no matching database records, or you'll have database records with broken file links. This WILL happen...it's just a matter of time. You'll have to write something to clean these up.
Anyways, my two cents.

What's the blob datatype for anyway, if not for storing files?
If your application involves authorisation prior to accessing the files, the changes are that you're a) storing the files outside of DOCUMENT_ROOT (so they can't be accessed directly; and possibly b) sending the entire contents of the files through the application (of course, maybe you're doing some sort of temporarilly-move-to-hashed-but-publicly-accessible-filename thing). So the memory overhead is there anyway, and you might as well be retrieving the data from the database.
If you must store files in a filesystem, do as Andreas suggested above, and name them using something you already know (i.e. the primary key of the relevant table).

I think that most database engines are so advanced already that storing BLOB's of data does not produce any disadvantages (bloated db etc). One advantage is that you don't have any broken links when the image is in the database already. That being said, I have myself always done so that I store the file on disk and give the URI to the database. It depends on the usage. It may be easier to handle img-in-db if the page is very dynamic and changes often - no namespace -problems. I have to say that it ends down to what you prefer.

I would suggest you do not store the image in your db. Instead since every user will be having a unique id associated with his/her profile in the db, use that id to store the image physically on the server.
e.g. if a user has id 23, you can store an image in www.yourname.com/users/profile_images/23.jpg. Then to display, you can check if the image exists, and display it accordingly else display your generic icon.

As the others suggested:
Store the images in the filesystem
Do not bother to store the filename, just use the user id (or anything else that "you already know")
Put static data on a different server (even if you just use "static.yourdomain.com" as an alias to your normal server)
Why ?
The bigger your database gets the slower it will get.
Storing your image in the database will increase your database size.
Storing the filename will increase your database size.
Putting static data on a different server (alias):
Makes scaling up a lot easier
Most browsers will not send more than two requests to the same server, by putting static data on a "second" server you speed up the loading

After researching for days, I made a system storing images and binaries on the database.
It was just great. I have now 100% control over the files, like access control, image sizing (I don't scale the images dynamically, of course), statistics, backup and maintenance.
In my speed tests, the sistem is now 10x slower. However, it's still not in production and I will implement system cache and other optimizations.
Check this real example, still in development, on a SHARED host, using a MVC:
http://www.gt8.com.br/salaodocalcado/calcados/meia-pata/
In this example, if a user is logged, he can see different images. All products images and others binaries are in DB, not cached, not in FS.
I have made some tests in a dedicated server and results were so far beyond the expectations.
So, in my personal opinion, although it needs a major effort to achieve it, storing images in DB is worth and the benefits are worth much more the cons.

As everybody else told you, never store images in a database.
A filesystem is used to store files -> images are files -> store them in filesystem :-)

Just tested my img's as blob, so. This solution working slower than images on server as file. Loading time should be same with images from DB or http but is't. Why? Im sure, when images are files on server, browser can caching it and loading only once, first time. When image going form DB, every time is loaded again. That's my oppinion. Maybe Im wrong about browser caching, but working slower (blob). Sry my English, whatever ;P

These are the pros of both solutions
In BLOBS :
1) pros : the easiness to mange clusters since you do not have to handle tricky points like file syncs between servers
2) DB backups will be exhaustive also
In files
1) Native caching handly (and that's the missing point of previous comments, with refresh and headers that you won't have to redesign in DB (DB are not handling last modification time by default)
2) Easiness of resizing later on
3) Easiness of moderation (just go through your folders to check if everything is correct)
For all these reasons and since the two pros of databases are easier to replicate on file system I strongly recommend files !

In my case, i store files in file system. In my images folder i create new folder for each item named based on item id (row from db). And name images in an order starting from 0. So if i have a table named Items like this:
Items
|-----|------|-----|
| ID | Name | Date|
|-----|------|-----|
| 29 | Test1| 2014|
|-----|------|-----|
| 30 | Test2| 2015|
|-----|------|-----|
| 31 | Test3| 2016|
|-----|------|-----|
my images directory looks like something like:
images/
29/
30/
31/
images/29/0.png
images/29/1.jpeg
images/29/2.gif
etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.