I accept file uploads from users. Each file has a pointer in the db which has info on the file location in the filesystem.
Currently, I'm storing the files in the filesystem non categorically, and each file is currently just named a unique value. All categorisation and naming etc is done in the app using the db.
A factor that I'm concerned about is that of file synchronization issues.
If I wanted to set up file system synchronization where, for example, the user's files are automatically updated by bridging with a pc app, would this system still work well?
I have no idea how such a system would work so hopefully I can get some input.
Basically, is representing a file's name and location purely in the database optimal, especially if said file may be synchronized with a pc application?
Yes, the way you are doing this is the best way to do it. You are using a file system to store files and a database to sore structured data.
One suggestion I would make is that you create a directory tree on the file system. You may one day run up against a maximum files per directory limitation of your file system. I have built systems that create a new sub directory for each day or week.
Make sure you have good backups of the database as well as the document repository.
All you need to make such a system work is to make sure the API you use (or, more likely, create) can talk to the database and to the filesystem in a sensible way. Since this is what your site is already doing anyway, it shoudn't be hard to implement.
The mere fact that your files are given identifiers instead of plain-English names is mostly irrelevant with regard to remote synchronization.
Store a file hash in the database rather than a path (i.e. SHA1) and have a separate database connect the hash with the path. Write a small app that will synchronize the hash database so that when you move your files to a different location it'll be easy to build a new database with updated paths.
That way you can also have the system load the file from a different location depending of which hash database you use to locate the file so it offers some transparency if you need people to be able to access the same file from diverse locations (i.e. nfs or webdav).
We use exactly this model for file storage, along with (shameless plug) SabreDAV to make it seem to the end-user it's a normal filesystem.
I think this is a perfectly fine model, as long as looking up the file is documented and easily retrieved there shouldn't be an issue. Just make backups of your DB :)
One other advice I can give, we use an md5() on the file-id to generate a unique filename. We use parts of the files to generate a directory structure, for example.. id 1 will yield: b026324c6904b2a9cb4b88d6d61c81d1, the resulting filename will become:
b02/632/4c6/904b2a9cb4b88d6d61c81d1 The reason for this is that most stable filesystems can become very slow after a high number of files (or directories) in one directory. It's much, much faster too traverse a few sub-directories.
The Boring Answer™:
I think it depends on what you wanna do, as always :)
I mean take your regular web hosting company. Developers are synching files to web servers all the time. Would it make sense for a web server to store hash-generated file names in a db that pointed to physical files? No. Then you couldn't log in with your FTP-client and upload files like that, and you'd have to code a custom module to get Apache to work etc. Instant headache.
Does it make sense for Flickr to use a db? Yes, absolutely! (Then again, you can't log in with an FTP-client and manage your photos—and that's probably a good thing!)
Just remember, a file system is a (very simple) db too. And it's a db that comes with a lot of useful free tools.
my 2¢
/0
Related
I have made a database with tables like projects, employees.
I have some fotos and pdf files (for instance, scan of certificates) related to the entries in my DB, that should be accessible via our internal web site.
Does anyone have any suggestions on how I should manage this?
I was thinking on setting up a subdomain "files.ourdomain.com", and create subdirectories there for each table. And make a directory for each record? Should i create a DB field for "employee.certificates" with the entire path/filename of the certificate foto?`
Or should i actually store the files in the database? (MySql INNODB)
In a comment, Peter said that it's easier and better to store the files in a database for security reasons and there is a case for this - only DBA's and the webservers will have access to the files and they could be encrypted. I don't agree with the "easier" bit though as you would need to encode the image before placing it into the DB and decode again when you want to display it.
You have a choice:
you store the files actually in the database itself
you store the file in a folder (outside of the web tree) and only store the
filename to the file
Each method has pros and cons:
If you store the files actually within the database you could take the backup file and put it onto another server and everything will be present. However, if you're using replication or clustering then each server will have a copy of the file and so storage requirements increase. Obvious, you will also need more space for backups, and backing up and restoration will take proportionally longer.
If you store the file in a central location and only record the location, your DB storage requirements are lower and multiple DB servers can be confident that there is only ever one copy of a file. Again, the files can be backed up separately. The downside is "what happens if your file storage server fails". However, with mirroring and backups, this can be mitigated against.
In both cases, you would need to store a filename for each file so that the webservers can use them.
Have a look at this Stack Overflow question which does a much better job of the pros and cons than I ever could.
Speaking personally, I store a filename which links to an external file.
Is it better to read and list images directly from file system using simple php, or is it better to store image meta info and filename in the database and access the images by doing a mysql select. What are the pros and cons of both solutions.
Listing files on a file system is probably the easiest way to accomplish what you trying to do but it's going to be very slow if you are trying to cycle through several thousand directories/files on a networked file system (NFS, CIFS, GlusterFS, etc).
Storing files in a database will create a much more overhead since you are now involving an external application to store information. You have to remember that every time you are using a database you are also using network I/O, authentication mechanism, query parser, etc. At the same time all of this overhead might provide for a faster response then using a networked file system.
To conclude - everything depends on amount of files you are working with and underlying infrastructure. Two major things to look out for are going to be disk I/O and network I/O.
I would do the following:
Upload all the images in one directory
Store references to those images that are tied to the uploader's User ID
Then just select the image URLs that are tied to that ID, and output them however necessary.
People find it easier to store their files within folders and parse that folder with php. If you go the database method the database eventually gets larger and larger and larger.
I can see it becoming personal preference, but I personally have gone with parsing folders for images rather than storing it within a database.
Depends on the scale of what you are doing.
This is what I would be doing.
Store the file metadata in the database. You can store quite a bit of information about this image this way.
Store the image file on a distributed storage system like Amazon S3. Store the path in your metadata. Replication is part of the system. And it easily integrates with Cloudfront CDN.
Distribute the the images through Amazon Cloudfront CDN.
I have a site that allows people to upload files to their account, and they're displayed in a list. Files for all users are stored on different servers, and they move around based on how popular they are (its a file hosting site).
I want to add the ability for users to group files into folders. I could go the conventional route and create physical folders on the hard drive, for each user on the server, and transverse them as expected. The downside to that is the user's files will be bound to a single server. If that server starts running of space (or many files get popular at the same time), it will get very tricky to mitigate it.
What I thought about doing is keeping the stateless nature of files, allowing them to be stored on any of the file servers, and simply storing the folder ID (in addition to the user id who owns the file) with each file in the database. So when a user decides to move a file, it doesn't get physically moved anywhere, you just change the folder ID in the database.
Is that a good idea? I use php and mysql.
Yes, it is.
I don't see any downside, except maybe more queries to the database, but with proper indexing of the parent folder id, this will probably be faster than accessing the filesystem directly.
Forget about folders and let the users tag their files, multiple tags per file. Then let them view files tagged X. This isn't much different to implement than virtual folders but is much more flexible for the users.
There are some very good questions here on SO about file management and storing within a large project.
Storing Images in DB - Yea or Nay?
Would you store binary data in database or in file system?
The first one having some great insights and in my project i've decided to go the file route and not the DB route.
A major point against using the filesystem is backup. But in our system we have a great backup scheme so i am not worried about that.
The next path is how to store the actual files. And I've thought about having the files' location static at all times and create a virtual directory system in the database side of things. So links to the file don't change.
The system i am building will have one global file management so all files are accessible to all users. But many that have gone the file route talk about physical directory size (if all the files are within one directory for example)
So my question is, what are some tips or best practice methods in creating folders for these static files, or if i shouldn't go the virtual directory route at all.
(the project is on the LAMP stack (PHP) if that helps at all)
One way is to assign a unique number to each file and use it to look up the actual file location. Then you an use that number to distribute files in different directories in the filesystem. For example you could use something like this scheme:
/images/{0}/{1}/{2}
{0}: file_number % 100
{1}: (file_number / 100) % 100
{2}: file_number
I've ran into this problem some time ago for a website that was hosting a lot of files. What we did was take a GUID (which is also the Primary Key field of a file) (e.g. BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301) and store a file like this: /B/C/C/BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301/filename.ext
This has certain advantages:
You can scale out the file servers over multiple servers (and assign specific directories to each one)
You don't have to rename the file
Your directories are guaranteed to be unique
Hope this helps!
In order to avoid creating an excessive number of entries in a single directory, you may want to base creating directories on pieces of the filename. So for instance, if you have a file named d7f5ae9b7c5a.png, you may want to store it in media/d7/f5/d7f5ae9b7c5a.png. If your filenames are all hexadecimal then this will restrict the number of entries in a single directory to 256 up until the final level.
I can't say much about how apache and PHP manage files, but I can say something about the ext3 file system. ext3 does not seem to have problems with large numbers of files in the same directory. I've tested it with up to a million files. Make sure the dir_index option is enabled on the file system before creating the directories. You can check by running dump2fs and change this option by running tune2fs. Hashing the files into a tree of subdirectories can still be useful because command line tools can still have problems listing the contents of the directory.
One user image ~ 100kb, so let have 10 000 users in database, each user will have in average 5 images, so we will have 5 terabytes DB, and each image output will be executed via a DB and this extra DB traffic will reduce the general DB server perfomance. ... you may use the DB cluster to avoid this, but suppose it is expensive
User report about error on live database, (on test - all works correctly), how would you create dump an unpack it on developers machine? How much time it will take?
In one moment you can decide to put images on some CDN, what will be the changes in your source code?
I usually take this approach:
Have a global settings variable for your application that points to the folder where you store uploaded files. In your database store the relative paths to the files (relative to what the settings variable points to).
So if a file is located at /www/uploads/image.jpg, your settings varible points to /www/uploads your database row has image.jpg. This is a flexible way that decouples your systems directory structure from your application.
Further you can fragment file storage in directories based on what database tables these relate to. Say you have a table user_reports and a table user_photos. You store the files that relate to user_reports in /www/uploads/user_reports. If you have large number of user uploads you can implement fragmentaion even further. Say a user uploads a file on 20.03.2009, the file is called report.pdf, so you store it at /www/uploads/user_reports/2009/03/20/report.pdf.
A while a go I had to developed a music site that allowed audio files to be uploaded to a site and then converted in to various formats using ffmpeg, people would then download the uploaded audio files after purchasing them and a tmp file would be created and placed at the download location and was only valid for each download instance and the tmp file would then get deleted.
Now I am revisiting the project, I have to add pictures and video as upload content also.
I want to find the best method for storing the files,
option 1 : storing the files in a folder and reference them in the database
option 2 : storing the actual file in the database(mysql) as blob.
I am toying around with this idea to consider the security implications of each method, and other issues I might have not calculated for.
See this earlier StackOverflow question Storing images in a database, Yea or nay?.
I know you mentioned images and video, however this question has relevance to all large binary content media files.
The consensus seems to be that storing file paths to the images on the filesystem, rather then the actual images is the way to go.
I would recommend storing as files and storing their locations in the database.
Storage the files in a database requires more resources and makes backing up/restoring databases slower.
Do you really want to have to transfer lots of videos every time you do a database dump?
File systems work very well for dishing out files, and you can back them up/sync them very easily.
I would go for the database option. I've used it on a number of projects, some very larger 100+GB. The storage implementation is key, design it poorly and your performance will be punished. See this example for some good implementation ideas:
Database storage allows more scalability and security.
I would go for storing files directly on the disk, and database holding only their ID/url.
This way accessing those files (that can be large, binary files) doesnt require any php/database operation, and it's done by the webserver directly.
Also it will be easier to move those files to another server if you'd want to.
Actually only one upside I can see atm of storing them in database is easier backup - you wanna backup your DB anyway, this way you'll have all data in one place and you can be sure that each backup is full (i.e. you don't have files on disk that aren't used by database entries; and you don't have image IDs in your database that point to nowhere)
I asked a similar question using Oracle as the backend for a Windows Forms application.
The answer really boils down to your requirements for backing up and restoring the files. If that requirement is important then use the database as it'll be easier (as you're backing up the database anyway, right? :o)