I'm building website with billions of images. I'm little confused to store images in single directory. how many images can be stored in single directory. or it slow down the server ?
Have you considered object storage such as AWS S3? http://aws.amazon.com/s3/
As for performance, I think it depends on the file system you intend to use. Some file systems index directory content in a linear manner, others use more efficient algorithms. It also depends whether any of your system services will need to scan the directory regularly.
I found this: http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler.pdf
In this question: https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory
Related
I'm looking for a solution to an application need. I need a web-based file manager/explorer that works with Amazon S3 buckets. The problem with most potential solutions I have found is that they are somehow relying on the s3 to maintain the directory hierarchy. This is bad because it means additional latency when traversing folders (and listing their contents).
What I would like is a php app/class that maintains the directory structure (and filenames) in a database, so that listing/traversing files and directories is quick and the s3 is only accessed when actually downloading or uploading a file.
Does anyone know of anything like this? I'm hoping there is something already in existence rather than taking the time to build from scratch.
Thanks!
I'd definitely recommend using Gaufrette.
It abstracts away the filesystem, so you're able to switch between local storage, FTP, SFTP, S3 etc simply by switching the Adapter
I've got nearly a million images for my site and they are stored in one folder on my windows server.
Since opening this folder directly on desktop drive me and my CPU crazy I am wondering that whether fetching one of them using my PHP script for a HTTP request is also laborious. So, will separating them into different folders improve the performance?
No, the performance does not depend on the number of files that are in a directory. The reason why opening the folder in Windows explorer is slow is because it has to render icons and various other GUI related things for each file.
When the web server fetches a file, it doesn't need to do that. It just (more or less) directly goes to the location of the file on the disk.
EDIT: Millions is kind of pushing the limits of your file system (I assume NTFS in your case). It appears that anything over 10,000 files in a directory starts to degrade your performance. So not only from a performance standpoint, but from an organizational standpoint as well, you may want to consider separating them into subdirectories.
Often the best answer in a case like this is to benchmark it. It shouldn't be too hard to create a program that opens 1000 hard-coded file names and closes them. Run the test on your million-plus directory and another directory containing only those 1000 files being tested and see if there's a difference.
Not only does the underlying file system make a difference, but accessing over a network can affect the results too.
Separating your files into seperate directories will most likely help performance. But as mark suggests it's probably worth benchmarking
I want to enable users to upload images & videos to a website.
Question is now, shall I just drop all the files in one folder or make a folder e.g. for each user?
(Of course it would be easier to find)
Does it make a difference in the performance?
Is it any difference in the access rate?
thanks
If you're dealing with many files, it's a common principle to distribute the files across multiple (sub-) directories. This is because if a directory contains too many directories and files, the file-system needs to do more work. Distributing the files helps then.
But this always depends on the underlying-file-system you use as a database. You need to look which one you use and then check the features it supports and which limits are given.
On the application layer, you should model file-access and handling so you can change how your application stores the files later on w/o rewriting your whole application.
I'm building a webapp that as a small subset of one of the features allows images to be uploaded. I'm running a lamp stack, with Mongo instead of MySql.
I use a javascript uploader with a php backend to upload the files. The whole framework is under version control though, so I don't want to dump these files anywhere inside my framework, as it would get messy with the version control, and sub-optimal when I eventually migrate the media over to a CDN.
So, my question is - On a VPS, where should I drop these images for now? In some folder external to my framework? In my DB as bson? I've heard Mongo does a decent job handling binary data...
And, as a follow up, if I'm eventually planning on moving the content over to a CDN, how would you recommend structuring my schema for now?
My current plan would be something like the following:
All uploads are named with a unique
ID and dropped in an external
folder, defined by a globally
accessible variable of sorts.
A reference to each images' name is
stored in the db.
Is there anything obviously stupid about going about it that way, possibly causing me a scaling headache later?
Here's a summarized specific question, just so this is a little more of an SO friendly question:
Given a Linux, Apache, Mongo, PHP Framework on a VPS, what is the best way to store uploaded images while keeping scalability and speed as the 2 most important factors in deciding on the solution?
if your plan is moving to CDN, the answer couldn't be more easy: create a subdomain on your VPS, and drop your images there, and you will have decent CDN simulation as well as reliable file storage.
I do agree that you should never put user uploaded content inside your webapp for many number of reasons but then one of the problem is how to show img tag in HTML which takes src attribute relative to webapp.
An easy workaround can be: Suppose create a folder /usr/ImagesUPloadByUser on your unix box and drop all images there. and then create a link ( linux link) in your webapp which points to correct directory. Now the images are not residing in ur webapp and you also have easy access .
I develop some PHP project on Linux platform. Are there any disadvantages of putting several thousand images (files) in one directory? This is closed set which won't grow. The alternative would be to separate this files using directory structure based on some ID (this way there would be let's say only 100 in one directory).
I ask this question, because often I see such separation when I look at images URLs on different sites. You can see that directory separation is done in such way, that no more then several hundreds images are in one directory.
What would I gain by not putting several thousand files (of not growing set) in one directory but separating them in groups of e.g. 100? Is it worth complicating things?
UPDATE:
There won't be any programmatic iteration over files in a directory (just a direct access to an image by it's filename)
I want to emphasize that the image set is closed. It's less then 5000 images, and that is it.
There is no logical categorization of this images
Human access/browse is not required
Images have unique filenames
OS: Debian/Linux 2.6.26-2-686, Filesystem: ext3
VALUABLE INFORMATION FROM THE ANSWERS:
Why separate many files to different directories:
"32k files limit per directory when using ext3 over nfs"
performance reason (access speed) [but for several thousand files it is difficult to say if it's worth, without measuring ]
In addition to faster file access by separating images into subdirectories, you also dramatically extend the number of files you can track before hitting the natural limits of the filesystem.
A simple approach is to md5() the file name, then use the first n characters as the directory name (eg, substr(md5($filename), 2)). This ensures a reasonably even distribution (vs taking the first n characters of the straight filename).
usually the reason for such splitting is file system performance.
for a closed set of 5000 files I am not sure it's worth the hassle.
I suggest that you try the simple approach of putting all the files in one directory thing, but keep an eye open on the actual time it takes to access the files.
if you see that it's not fast enough for your needs, you can split it like you suggested.
I had to split files myself for performance reasons.
in addition I bumped into a 32k files limit per directory when using ext3 over nfs (not sure if it's a limit of nfs or ext3).
so that's another reason to split into multiple directories.
in any case, try with a single dir and only split if you see it's not fast enough.
There is no reason to split those files into multiple directories, if you won't expect any filename conflicts and if you don't need to iterate over those images at any point.
But still, if you can think of a suggestive categorization, it's not a bad idea to sort the images a bit, even if it is just for maintenance reasons.
I think there is two aspects to this question:
Does the Linux file system that you're using efficiently support directories with thousands of files. I'm not an expert, but I think the newer file systems won't have problems.
Are there performance issues with specific PHP functions? I think direct access to files should be okay, but if you're doing directory listings then you might eventually run into time or memory problems.
The only reason I could imagine where it would be detrimental was when iterating over the directory. More files, means more iterations. But that's basically all I can think of from a programming perspective.
Several thousand images are still okay. When you access a directory, operating systems reads the listing of its files by blocks of 4K. If you have plain directory structure, it may take time to read the whole file listing if there are many (e. g. hundred thousand) files in it.
If changing the filesystem is an option, I'd recommend moving wherever you store all the images to a ReiserFS filesystem. It is excellent at fast storage/access of lots of small files.
If not, MightyE's response of breaking them into folders is most logical and will increase access times by a considerable margin.