Tips for managing a large number of files?

Tips for managing a large number of files? - php

There are some very good questions here on SO about file management and storing within a large project.
Storing Images in DB - Yea or Nay?
Would you store binary data in database or in file system?
The first one having some great insights and in my project i've decided to go the file route and not the DB route.
A major point against using the filesystem is backup. But in our system we have a great backup scheme so i am not worried about that.
The next path is how to store the actual files. And I've thought about having the files' location static at all times and create a virtual directory system in the database side of things. So links to the file don't change.
The system i am building will have one global file management so all files are accessible to all users. But many that have gone the file route talk about physical directory size (if all the files are within one directory for example)
So my question is, what are some tips or best practice methods in creating folders for these static files, or if i shouldn't go the virtual directory route at all.
(the project is on the LAMP stack (PHP) if that helps at all)

One way is to assign a unique number to each file and use it to look up the actual file location. Then you an use that number to distribute files in different directories in the filesystem. For example you could use something like this scheme:
/images/{0}/{1}/{2}
{0}: file_number % 100
{1}: (file_number / 100) % 100
{2}: file_number

I've ran into this problem some time ago for a website that was hosting a lot of files. What we did was take a GUID (which is also the Primary Key field of a file) (e.g. BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301) and store a file like this: /B/C/C/BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301/filename.ext
This has certain advantages:
You can scale out the file servers over multiple servers (and assign specific directories to each one)
You don't have to rename the file
Your directories are guaranteed to be unique
Hope this helps!

In order to avoid creating an excessive number of entries in a single directory, you may want to base creating directories on pieces of the filename. So for instance, if you have a file named d7f5ae9b7c5a.png, you may want to store it in media/d7/f5/d7f5ae9b7c5a.png. If your filenames are all hexadecimal then this will restrict the number of entries in a single directory to 256 up until the final level.

I can't say much about how apache and PHP manage files, but I can say something about the ext3 file system. ext3 does not seem to have problems with large numbers of files in the same directory. I've tested it with up to a million files. Make sure the dir_index option is enabled on the file system before creating the directories. You can check by running dump2fs and change this option by running tune2fs. Hashing the files into a tree of subdirectories can still be useful because command line tools can still have problems listing the contents of the directory.

One user image ~ 100kb, so let have 10 000 users in database, each user will have in average 5 images, so we will have 5 terabytes DB, and each image output will be executed via a DB and this extra DB traffic will reduce the general DB server perfomance. ... you may use the DB cluster to avoid this, but suppose it is expensive
User report about error on live database, (on test - all works correctly), how would you create dump an unpack it on developers machine? How much time it will take?
In one moment you can decide to put images on some CDN, what will be the changes in your source code?

I usually take this approach:
Have a global settings variable for your application that points to the folder where you store uploaded files. In your database store the relative paths to the files (relative to what the settings variable points to).
So if a file is located at /www/uploads/image.jpg, your settings varible points to /www/uploads your database row has image.jpg. This is a flexible way that decouples your systems directory structure from your application.
Further you can fragment file storage in directories based on what database tables these relate to. Say you have a table user_reports and a table user_photos. You store the files that relate to user_reports in /www/uploads/user_reports. If you have large number of user uploads you can implement fragmentaion even further. Say a user uploads a file on 20.03.2009, the file is called report.pdf, so you store it at /www/uploads/user_reports/2009/03/20/report.pdf.

Related

PHP uploading and downloading files

I'm facing a dilemma on how to implement file upload and download in a PHP website.
I have these criteria:
Performance - does not give performance issues to the website
File size - around 2GB and up.
Authorization - I want to be able to change who can access the files in PHP. Allow multiple users to gain access to a single file.
User friendly - no additional tools to use.
So here are the methods I'm currently looking at and how I assess them based on my criteria:
Database BLOB
Writing the file data into the output stream will take time and blocks other requests (is this correct?)
I read somewhere that there's a size limit for BLOB.
OK - I can easily control who can download the files here.
OK - No additional tools, just the website.
FTP
OK - since it is designed to store files.
OK - file system is the limit.
I need to create another credentials for each user aside from the username and password for the website. I assume I have to move the file from one location to another to update authorization, but how if multiple users can access one file? Shared directory? It looks messy.
Need another tools/program for accesing their files, need to remember another username and password.
My questions:
Based on my assumptions, do you thimk I understand the methods correctly?
If my assumptions are wrong, is there a way I can do this functionality while meeting my criteria?
PS Please excuse my English.

Why not just use the file system to store the files and store the path to the given file (+ permissions, if needed) in addition in a database.
The upload folder isn't accessible from the public and an wrapper script serves the content to the user.
Performance shouldn't be a problem as you just move/copy the uploaded file to a dedicated data directory.
File size isn't a problem (as long, as you have enough disk space)
The wrapper script handles permissions and serves files to the users
It's as friendly, as you design your ui

for me the usage of BLOB is not the best. I thought about BLOB to upload pictures in my own website, but the best way to upload file is to put them directly on ur server locally.

Advice on how to store/manage files for entries in DB

I have made a database with tables like projects, employees.
I have some fotos and pdf files (for instance, scan of certificates) related to the entries in my DB, that should be accessible via our internal web site.
Does anyone have any suggestions on how I should manage this?
I was thinking on setting up a subdomain "files.ourdomain.com", and create subdirectories there for each table. And make a directory for each record? Should i create a DB field for "employee.certificates" with the entire path/filename of the certificate foto?`
Or should i actually store the files in the database? (MySql INNODB)

In a comment, Peter said that it's easier and better to store the files in a database for security reasons and there is a case for this - only DBA's and the webservers will have access to the files and they could be encrypted. I don't agree with the "easier" bit though as you would need to encode the image before placing it into the DB and decode again when you want to display it.
You have a choice:
you store the files actually in the database itself
you store the file in a folder (outside of the web tree) and only store the
filename to the file
Each method has pros and cons:
If you store the files actually within the database you could take the backup file and put it onto another server and everything will be present. However, if you're using replication or clustering then each server will have a copy of the file and so storage requirements increase. Obvious, you will also need more space for backups, and backing up and restoration will take proportionally longer.
If you store the file in a central location and only record the location, your DB storage requirements are lower and multiple DB servers can be confident that there is only ever one copy of a file. Again, the files can be backed up separately. The downside is "what happens if your file storage server fails". However, with mirroring and backups, this can be mitigated against.
In both cases, you would need to store a filename for each file so that the webservers can use them.
Have a look at this Stack Overflow question which does a much better job of the pros and cons than I ever could.
Speaking personally, I store a filename which links to an external file.

Storing files in directories in an online file manager

I have a site that allows people to upload files to their account, and they're displayed in a list. Files for all users are stored on different servers, and they move around based on how popular they are (its a file hosting site).
I want to add the ability for users to group files into folders. I could go the conventional route and create physical folders on the hard drive, for each user on the server, and transverse them as expected. The downside to that is the user's files will be bound to a single server. If that server starts running of space (or many files get popular at the same time), it will get very tricky to mitigate it.
What I thought about doing is keeping the stateless nature of files, allowing them to be stored on any of the file servers, and simply storing the folder ID (in addition to the user id who owns the file) with each file in the database. So when a user decides to move a file, it doesn't get physically moved anywhere, you just change the folder ID in the database.
Is that a good idea? I use php and mysql.

Yes, it is.
I don't see any downside, except maybe more queries to the database, but with proper indexing of the parent folder id, this will probably be faster than accessing the filesystem directly.

Forget about folders and let the users tag their files, multiple tags per file. Then let them view files tagged X. This isn't much different to implement than virtual folders but is much more flexible for the users.

Future proof file storage

I accept file uploads from users. Each file has a pointer in the db which has info on the file location in the filesystem.
Currently, I'm storing the files in the filesystem non categorically, and each file is currently just named a unique value. All categorisation and naming etc is done in the app using the db.
A factor that I'm concerned about is that of file synchronization issues.
If I wanted to set up file system synchronization where, for example, the user's files are automatically updated by bridging with a pc app, would this system still work well?
I have no idea how such a system would work so hopefully I can get some input.
Basically, is representing a file's name and location purely in the database optimal, especially if said file may be synchronized with a pc application?

Yes, the way you are doing this is the best way to do it. You are using a file system to store files and a database to sore structured data.
One suggestion I would make is that you create a directory tree on the file system. You may one day run up against a maximum files per directory limitation of your file system. I have built systems that create a new sub directory for each day or week.
Make sure you have good backups of the database as well as the document repository.

All you need to make such a system work is to make sure the API you use (or, more likely, create) can talk to the database and to the filesystem in a sensible way. Since this is what your site is already doing anyway, it shoudn't be hard to implement.
The mere fact that your files are given identifiers instead of plain-English names is mostly irrelevant with regard to remote synchronization.

Store a file hash in the database rather than a path (i.e. SHA1) and have a separate database connect the hash with the path. Write a small app that will synchronize the hash database so that when you move your files to a different location it'll be easy to build a new database with updated paths.
That way you can also have the system load the file from a different location depending of which hash database you use to locate the file so it offers some transparency if you need people to be able to access the same file from diverse locations (i.e. nfs or webdav).

We use exactly this model for file storage, along with (shameless plug) SabreDAV to make it seem to the end-user it's a normal filesystem.
I think this is a perfectly fine model, as long as looking up the file is documented and easily retrieved there shouldn't be an issue. Just make backups of your DB :)
One other advice I can give, we use an md5() on the file-id to generate a unique filename. We use parts of the files to generate a directory structure, for example.. id 1 will yield: b026324c6904b2a9cb4b88d6d61c81d1, the resulting filename will become:
b02/632/4c6/904b2a9cb4b88d6d61c81d1 The reason for this is that most stable filesystems can become very slow after a high number of files (or directories) in one directory. It's much, much faster too traverse a few sub-directories.

The Boring Answer™:
I think it depends on what you wanna do, as always :)
I mean take your regular web hosting company. Developers are synching files to web servers all the time. Would it make sense for a web server to store hash-generated file names in a db that pointed to physical files? No. Then you couldn't log in with your FTP-client and upload files like that, and you'd have to code a custom module to get Apache to work etc. Instant headache.
Does it make sense for Flickr to use a db? Yes, absolutely! (Then again, you can't log in with an FTP-client and manage your photos—and that's probably a good thing!)
Just remember, a file system is a (very simple) db too. And it's a db that comes with a lot of useful free tools.
my 2¢
/0

Where to store uploaded files (sound, pictures and video)

A while a go I had to developed a music site that allowed audio files to be uploaded to a site and then converted in to various formats using ffmpeg, people would then download the uploaded audio files after purchasing them and a tmp file would be created and placed at the download location and was only valid for each download instance and the tmp file would then get deleted.
Now I am revisiting the project, I have to add pictures and video as upload content also.
I want to find the best method for storing the files,
option 1 : storing the files in a folder and reference them in the database
option 2 : storing the actual file in the database(mysql) as blob.
I am toying around with this idea to consider the security implications of each method, and other issues I might have not calculated for.

See this earlier StackOverflow question Storing images in a database, Yea or nay?.
I know you mentioned images and video, however this question has relevance to all large binary content media files.
The consensus seems to be that storing file paths to the images on the filesystem, rather then the actual images is the way to go.

I would recommend storing as files and storing their locations in the database.
Storage the files in a database requires more resources and makes backing up/restoring databases slower.
Do you really want to have to transfer lots of videos every time you do a database dump?
File systems work very well for dishing out files, and you can back them up/sync them very easily.

I would go for the database option. I've used it on a number of projects, some very larger 100+GB. The storage implementation is key, design it poorly and your performance will be punished. See this example for some good implementation ideas:
Database storage allows more scalability and security.

I would go for storing files directly on the disk, and database holding only their ID/url.
This way accessing those files (that can be large, binary files) doesnt require any php/database operation, and it's done by the webserver directly.
Also it will be easier to move those files to another server if you'd want to.
Actually only one upside I can see atm of storing them in database is easier backup - you wanna backup your DB anyway, this way you'll have all data in one place and you can be sure that each backup is full (i.e. you don't have files on disk that aren't used by database entries; and you don't have image IDs in your database that point to nowhere)

I asked a similar question using Oracle as the backend for a Windows Forms application.
The answer really boils down to your requirements for backing up and restoring the files. If that requirement is important then use the database as it'll be easier (as you're backing up the database anyway, right? :o)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.