I have a site that allows people to upload files to their account, and they're displayed in a list. Files for all users are stored on different servers, and they move around based on how popular they are (its a file hosting site).
I want to add the ability for users to group files into folders. I could go the conventional route and create physical folders on the hard drive, for each user on the server, and transverse them as expected. The downside to that is the user's files will be bound to a single server. If that server starts running of space (or many files get popular at the same time), it will get very tricky to mitigate it.
What I thought about doing is keeping the stateless nature of files, allowing them to be stored on any of the file servers, and simply storing the folder ID (in addition to the user id who owns the file) with each file in the database. So when a user decides to move a file, it doesn't get physically moved anywhere, you just change the folder ID in the database.
Is that a good idea? I use php and mysql.
Yes, it is.
I don't see any downside, except maybe more queries to the database, but with proper indexing of the parent folder id, this will probably be faster than accessing the filesystem directly.
Forget about folders and let the users tag their files, multiple tags per file. Then let them view files tagged X. This isn't much different to implement than virtual folders but is much more flexible for the users.
Related
I have seen many questions concerning the storage of user uploaded image files onto a web application, but most of these are dealing with the following:
Indexing of the images, so as to retrieve them later
How to store them (on the server itself as a file or in the database)
I have a question in regards to this subject, but the question is:
In what directory do I put the uploaded image file? (or other file type, for that matter)
I have a small group I am running php apps for. There is very little files that get uploaded, but nontheless, they get uploaded.
I currently have them in my public html document root under /var/www/images/* , however I am told that it is not smart to store your user uploaded content straight to the /var/www/* directory and that it should be stored elsewhere.
However I cannot find a straightforward statement of where "elsewhere" is.
Keep in mind I do not have a server farm where I can establish certain servers for specific purposes (such as uploaded user files).
Therefore, on a single webserver that hosts usual scripting files, etc. what is the best storage practice for such content?
Thank you.
I don't think there's necessarily a 'best practice' per se; anywhere on your server will be fine, so long as you're able to retrieve the images later on. Typically they'd go inside a folder under /var/www/images/.
Personally I'd recommend creating an individual folder to store these user-uploaded images in (as something like /var/www/images/user_uploads), so that they don't get confused with other images you might have uploaded directly to /var/www/images/ (such as backgrounds or core imagery).
I have a framework that I've written. I created a package for said framework that allows me to track my employee's hours, and gives them a place to dump files for the accountant. (Invoices, etc.)
The problem is that these file dumps are accessible through the browser. I could use a .htaccess file to prevent the files from being served up at all, but the problem is that I would like the accountant and the employee to be able to download their files.
A solution might be to read the files with PHP, create a temporary copy, have the user or accountant download that copy, then delete the copy...but this poses two problems.
1) It's going to be time and resource intensive...especially for large files.
2) For whatever short amount of time, there will be a copy of the file which is accessible to whoever knows the URL.
The other solution would be to put the files outside of the public folder, but the problem is that I would like this package to be portable, and a lot of my servers are shared.
What method could I use to be able to serve the files only when authenticated, and avoid the flaws I described above?
I have made a database with tables like projects, employees.
I have some fotos and pdf files (for instance, scan of certificates) related to the entries in my DB, that should be accessible via our internal web site.
Does anyone have any suggestions on how I should manage this?
I was thinking on setting up a subdomain "files.ourdomain.com", and create subdirectories there for each table. And make a directory for each record? Should i create a DB field for "employee.certificates" with the entire path/filename of the certificate foto?`
Or should i actually store the files in the database? (MySql INNODB)
In a comment, Peter said that it's easier and better to store the files in a database for security reasons and there is a case for this - only DBA's and the webservers will have access to the files and they could be encrypted. I don't agree with the "easier" bit though as you would need to encode the image before placing it into the DB and decode again when you want to display it.
You have a choice:
you store the files actually in the database itself
you store the file in a folder (outside of the web tree) and only store the
filename to the file
Each method has pros and cons:
If you store the files actually within the database you could take the backup file and put it onto another server and everything will be present. However, if you're using replication or clustering then each server will have a copy of the file and so storage requirements increase. Obvious, you will also need more space for backups, and backing up and restoration will take proportionally longer.
If you store the file in a central location and only record the location, your DB storage requirements are lower and multiple DB servers can be confident that there is only ever one copy of a file. Again, the files can be backed up separately. The downside is "what happens if your file storage server fails". However, with mirroring and backups, this can be mitigated against.
In both cases, you would need to store a filename for each file so that the webservers can use them.
Have a look at this Stack Overflow question which does a much better job of the pros and cons than I ever could.
Speaking personally, I store a filename which links to an external file.
I accept file uploads from users. Each file has a pointer in the db which has info on the file location in the filesystem.
Currently, I'm storing the files in the filesystem non categorically, and each file is currently just named a unique value. All categorisation and naming etc is done in the app using the db.
A factor that I'm concerned about is that of file synchronization issues.
If I wanted to set up file system synchronization where, for example, the user's files are automatically updated by bridging with a pc app, would this system still work well?
I have no idea how such a system would work so hopefully I can get some input.
Basically, is representing a file's name and location purely in the database optimal, especially if said file may be synchronized with a pc application?
Yes, the way you are doing this is the best way to do it. You are using a file system to store files and a database to sore structured data.
One suggestion I would make is that you create a directory tree on the file system. You may one day run up against a maximum files per directory limitation of your file system. I have built systems that create a new sub directory for each day or week.
Make sure you have good backups of the database as well as the document repository.
All you need to make such a system work is to make sure the API you use (or, more likely, create) can talk to the database and to the filesystem in a sensible way. Since this is what your site is already doing anyway, it shoudn't be hard to implement.
The mere fact that your files are given identifiers instead of plain-English names is mostly irrelevant with regard to remote synchronization.
Store a file hash in the database rather than a path (i.e. SHA1) and have a separate database connect the hash with the path. Write a small app that will synchronize the hash database so that when you move your files to a different location it'll be easy to build a new database with updated paths.
That way you can also have the system load the file from a different location depending of which hash database you use to locate the file so it offers some transparency if you need people to be able to access the same file from diverse locations (i.e. nfs or webdav).
We use exactly this model for file storage, along with (shameless plug) SabreDAV to make it seem to the end-user it's a normal filesystem.
I think this is a perfectly fine model, as long as looking up the file is documented and easily retrieved there shouldn't be an issue. Just make backups of your DB :)
One other advice I can give, we use an md5() on the file-id to generate a unique filename. We use parts of the files to generate a directory structure, for example.. id 1 will yield: b026324c6904b2a9cb4b88d6d61c81d1, the resulting filename will become:
b02/632/4c6/904b2a9cb4b88d6d61c81d1 The reason for this is that most stable filesystems can become very slow after a high number of files (or directories) in one directory. It's much, much faster too traverse a few sub-directories.
The Boring Answer™:
I think it depends on what you wanna do, as always :)
I mean take your regular web hosting company. Developers are synching files to web servers all the time. Would it make sense for a web server to store hash-generated file names in a db that pointed to physical files? No. Then you couldn't log in with your FTP-client and upload files like that, and you'd have to code a custom module to get Apache to work etc. Instant headache.
Does it make sense for Flickr to use a db? Yes, absolutely! (Then again, you can't log in with an FTP-client and manage your photos—and that's probably a good thing!)
Just remember, a file system is a (very simple) db too. And it's a db that comes with a lot of useful free tools.
my 2¢
/0
There are some very good questions here on SO about file management and storing within a large project.
Storing Images in DB - Yea or Nay?
Would you store binary data in database or in file system?
The first one having some great insights and in my project i've decided to go the file route and not the DB route.
A major point against using the filesystem is backup. But in our system we have a great backup scheme so i am not worried about that.
The next path is how to store the actual files. And I've thought about having the files' location static at all times and create a virtual directory system in the database side of things. So links to the file don't change.
The system i am building will have one global file management so all files are accessible to all users. But many that have gone the file route talk about physical directory size (if all the files are within one directory for example)
So my question is, what are some tips or best practice methods in creating folders for these static files, or if i shouldn't go the virtual directory route at all.
(the project is on the LAMP stack (PHP) if that helps at all)
One way is to assign a unique number to each file and use it to look up the actual file location. Then you an use that number to distribute files in different directories in the filesystem. For example you could use something like this scheme:
/images/{0}/{1}/{2}
{0}: file_number % 100
{1}: (file_number / 100) % 100
{2}: file_number
I've ran into this problem some time ago for a website that was hosting a lot of files. What we did was take a GUID (which is also the Primary Key field of a file) (e.g. BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301) and store a file like this: /B/C/C/BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301/filename.ext
This has certain advantages:
You can scale out the file servers over multiple servers (and assign specific directories to each one)
You don't have to rename the file
Your directories are guaranteed to be unique
Hope this helps!
In order to avoid creating an excessive number of entries in a single directory, you may want to base creating directories on pieces of the filename. So for instance, if you have a file named d7f5ae9b7c5a.png, you may want to store it in media/d7/f5/d7f5ae9b7c5a.png. If your filenames are all hexadecimal then this will restrict the number of entries in a single directory to 256 up until the final level.
I can't say much about how apache and PHP manage files, but I can say something about the ext3 file system. ext3 does not seem to have problems with large numbers of files in the same directory. I've tested it with up to a million files. Make sure the dir_index option is enabled on the file system before creating the directories. You can check by running dump2fs and change this option by running tune2fs. Hashing the files into a tree of subdirectories can still be useful because command line tools can still have problems listing the contents of the directory.
One user image ~ 100kb, so let have 10 000 users in database, each user will have in average 5 images, so we will have 5 terabytes DB, and each image output will be executed via a DB and this extra DB traffic will reduce the general DB server perfomance. ... you may use the DB cluster to avoid this, but suppose it is expensive
User report about error on live database, (on test - all works correctly), how would you create dump an unpack it on developers machine? How much time it will take?
In one moment you can decide to put images on some CDN, what will be the changes in your source code?
I usually take this approach:
Have a global settings variable for your application that points to the folder where you store uploaded files. In your database store the relative paths to the files (relative to what the settings variable points to).
So if a file is located at /www/uploads/image.jpg, your settings varible points to /www/uploads your database row has image.jpg. This is a flexible way that decouples your systems directory structure from your application.
Further you can fragment file storage in directories based on what database tables these relate to. Say you have a table user_reports and a table user_photos. You store the files that relate to user_reports in /www/uploads/user_reports. If you have large number of user uploads you can implement fragmentaion even further. Say a user uploads a file on 20.03.2009, the file is called report.pdf, so you store it at /www/uploads/user_reports/2009/03/20/report.pdf.