I develop some PHP project on Linux platform. Are there any disadvantages of putting several thousand images (files) in one directory? This is closed set which won't grow. The alternative would be to separate this files using directory structure based on some ID (this way there would be let's say only 100 in one directory).
I ask this question, because often I see such separation when I look at images URLs on different sites. You can see that directory separation is done in such way, that no more then several hundreds images are in one directory.
What would I gain by not putting several thousand files (of not growing set) in one directory but separating them in groups of e.g. 100? Is it worth complicating things?
UPDATE:
There won't be any programmatic iteration over files in a directory (just a direct access to an image by it's filename)
I want to emphasize that the image set is closed. It's less then 5000 images, and that is it.
There is no logical categorization of this images
Human access/browse is not required
Images have unique filenames
OS: Debian/Linux 2.6.26-2-686, Filesystem: ext3
VALUABLE INFORMATION FROM THE ANSWERS:
Why separate many files to different directories:
"32k files limit per directory when using ext3 over nfs"
performance reason (access speed) [but for several thousand files it is difficult to say if it's worth, without measuring ]
In addition to faster file access by separating images into subdirectories, you also dramatically extend the number of files you can track before hitting the natural limits of the filesystem.
A simple approach is to md5() the file name, then use the first n characters as the directory name (eg, substr(md5($filename), 2)). This ensures a reasonably even distribution (vs taking the first n characters of the straight filename).
usually the reason for such splitting is file system performance.
for a closed set of 5000 files I am not sure it's worth the hassle.
I suggest that you try the simple approach of putting all the files in one directory thing, but keep an eye open on the actual time it takes to access the files.
if you see that it's not fast enough for your needs, you can split it like you suggested.
I had to split files myself for performance reasons.
in addition I bumped into a 32k files limit per directory when using ext3 over nfs (not sure if it's a limit of nfs or ext3).
so that's another reason to split into multiple directories.
in any case, try with a single dir and only split if you see it's not fast enough.
There is no reason to split those files into multiple directories, if you won't expect any filename conflicts and if you don't need to iterate over those images at any point.
But still, if you can think of a suggestive categorization, it's not a bad idea to sort the images a bit, even if it is just for maintenance reasons.
I think there is two aspects to this question:
Does the Linux file system that you're using efficiently support directories with thousands of files. I'm not an expert, but I think the newer file systems won't have problems.
Are there performance issues with specific PHP functions? I think direct access to files should be okay, but if you're doing directory listings then you might eventually run into time or memory problems.
The only reason I could imagine where it would be detrimental was when iterating over the directory. More files, means more iterations. But that's basically all I can think of from a programming perspective.
Several thousand images are still okay. When you access a directory, operating systems reads the listing of its files by blocks of 4K. If you have plain directory structure, it may take time to read the whole file listing if there are many (e. g. hundred thousand) files in it.
If changing the filesystem is an option, I'd recommend moving wherever you store all the images to a ReiserFS filesystem. It is excellent at fast storage/access of lots of small files.
If not, MightyE's response of breaking them into folders is most logical and will increase access times by a considerable margin.
Related
Even if there seem to exist a few duplicate questions, I think this one is unique. I'm not asking if there are any limits, it's only about performance drawbacks in context of Apache. Or unix file system in general.
Lets say if I request a file from an Apache server
http://example.com/media/example.jpg
does it matter how many files there are in the same directory "media"?
The reason I'm asking is that my PHP application generates images on the fly.
Once created, it places it at the same location the PHP script would trigger due to ModRewrite. If the file exists, Apache will skip the whole PHP execution and directly serve the static image instead. Some kind of gateway cache if you want to call it that way.
Apache has basically two things to do:
Check if the file exists
Serve the file or forward the request to PHP
Till now, I have about 25.000 files with about 8 GB in this single directory. I expect it to grow at least 10 times in the next years.
While I don't face any issues managing these files, I have the slight feeling that it keeps getting slower when requesting them via HTTP. So I wondered if this is really what happens or if it's just my subjective impression.
Most file systems based on the Berkeley FFS will degrade in performance with large numbers of files in one directory due to multiple levels of indirection.
I don't know about other file systems like HFS or NTFS, but my suspicion is that they may well suffer from the same issue.
I once had to deal with a similar issue and ended up using a map for the files.
I think it was something like md5 myfilename-00001 yielding (for example): e5948ba174d28e80886a48336dcdf4a4 which I then put into a file named e5/94/8ba174d28e80886a48336dcdf4a4. Then a map file mapped 'myfilename-00001' to 'e5/94/8ba174d28e80886a48336dcdf4a4'. This not-quite-elegant solution worked for my purposes and it only took a little bit of code.
I made a dynamic site that has over 20,000 pages and once a page is created there is no need to update it for at-least one month or even a year. So I'm caching every page when it is first created and then delivering it from a static html page
I'm running a php script (whole CMS is on PHP) if (file_exists($filename)) to first search for the filename from the url in cache-files directory and if it matches then deliver it otherwise generate the page and cache it for latter use. Though it is dynamic but still my url does not contain ?&=, I'm doing this by - and breaking it into array.
What I want to know is will it create any problem to search for a file from that huge directory?
I saw a few Q/A like this where it says that there should not be problem with number of files I can store on directory with ext2 or ext3 (I guess my server has ext3) file system but the speed of creating a new file will decrease rapidly after there are files over 20-30,000.
Currently I'm on a shared host and I must cache files. My host a soft limit of 100,000 files in my whole box which is good enough so far.
Can someone please give me any better idea about how to cache the site.
You shouldn't place all of the 20K files in a single directory.
Divide them into directories (by letter, for example), so you access:
a/apple-pie-recipe
j/john-doe-for-presidency
etc.
That would allow you to place more files with less constraints on the file-system, which would increase the speed. (since the FS doesn't need to figure out where your file is in the directory along with other 20k files, it needs to look in about a hundred)
there should not be problem with number of files I can store on directory with ext2 or ext3
That's rather an old document - there are 2 big differences between ext2 and ext3 - journalling is one, the other is H-TREE indexing of directories (which reduces the impact of storing lots of files in the same directory). While it's trivial to add journalling to an ext2 filesystem and mount it as ext3, this does not give the benefits of dir_index - this requires a full fsck.
Regardless of the filesystem, using a nested directory structure makes the system a lot more manageable and scalable - and avoids performance problems on older filesystems.
(I'm doing 3 other things since I started writing this and see someone else has suggested something similar - however Madara's approach doesn't give an evenly balanced tree, OTOH having a semantic path may be more desirable)
e.g.
define('GEN_BASE_PATH','/var/data/cache-failes');
define('GEN_LEVELS', 2);
function gen_file_path($id)
{
$key=md5($id);
$fname='';
for ($x=0; $x<=GEN_LEVELS; $x++) {
$fname=substr($key, 0, 1) . "/";
$key=substr($key,1);
}
return GEN_BASE_PATH . "/" . $fname . $key;
}
However the real way to solve the problem would be to serve the content with the right headers and run a caching reverse proxy in front of the webserver (though this isn't really practical for a very lwo volume site).
I've got nearly a million images for my site and they are stored in one folder on my windows server.
Since opening this folder directly on desktop drive me and my CPU crazy I am wondering that whether fetching one of them using my PHP script for a HTTP request is also laborious. So, will separating them into different folders improve the performance?
No, the performance does not depend on the number of files that are in a directory. The reason why opening the folder in Windows explorer is slow is because it has to render icons and various other GUI related things for each file.
When the web server fetches a file, it doesn't need to do that. It just (more or less) directly goes to the location of the file on the disk.
EDIT: Millions is kind of pushing the limits of your file system (I assume NTFS in your case). It appears that anything over 10,000 files in a directory starts to degrade your performance. So not only from a performance standpoint, but from an organizational standpoint as well, you may want to consider separating them into subdirectories.
Often the best answer in a case like this is to benchmark it. It shouldn't be too hard to create a program that opens 1000 hard-coded file names and closes them. Run the test on your million-plus directory and another directory containing only those 1000 files being tested and see if there's a difference.
Not only does the underlying file system make a difference, but accessing over a network can affect the results too.
Separating your files into seperate directories will most likely help performance. But as mark suggests it's probably worth benchmarking
I want to enable users to upload images & videos to a website.
Question is now, shall I just drop all the files in one folder or make a folder e.g. for each user?
(Of course it would be easier to find)
Does it make a difference in the performance?
Is it any difference in the access rate?
thanks
If you're dealing with many files, it's a common principle to distribute the files across multiple (sub-) directories. This is because if a directory contains too many directories and files, the file-system needs to do more work. Distributing the files helps then.
But this always depends on the underlying-file-system you use as a database. You need to look which one you use and then check the features it supports and which limits are given.
On the application layer, you should model file-access and handling so you can change how your application stores the files later on w/o rewriting your whole application.
I'm using PHP to make a simple caching system, but I'm going to be caching up to 10,000 files in one run of the script. At the moment I'm using a simple loop with
$file = "../cache/".$id.".htm";
$handle = fopen($file, 'w');
fwrite($handle, $temp);
fclose($handle);
($id being a random string which is assigned to a row in a database)
but it seems a little bit slow, is there a better method to doing that? Also I read somewhere that on some operating systems you can't store thousands and thousands of files in one single directory, is this relevant to CentOS or Debian? Bare in mind this folder may well end up having over a million small files in it.
Simple questions I suppose but I don't want to get scaling this code and then find out I'm doing it wrong, I'm only testing with chaching 10-30 pages at a time at the moment.
Remember that in UNIX, everything is a file.
When you put that many files into a directory, something has to keep track of those files. If you do an :-
ls -la
You'll probably note that the '.' has grown to some size. This is where all the info on your 10000 files is stored.
Every seek, and every write into that directory will involve parsing that large directory entry.
You should implement some kind of directory hashing system. This'll involve creating subdirectories under your target dir.
eg.
/somedir/a/b/c/yourfile.txt
/somedir/d/e/f/yourfile.txt
This'll keep the size of each directory entry quite small, and speed up IO operations.
The number of files you can effectively use in one directory is not op. system but filesystem dependent.
You can split your cache dir effectively by getting the md5 hash of the filename, taking the first 1, 2 or 3 characters of it and use it as a directory. Of course you have to create the dir if it's not exsists and use the same approach when retrieving files from cache.
For a few tens of thousands, 2 characters (256 subdirs from 00 to ff) would be enough.
File I/O in general is relatively slow. If you are looping over 1000's of files, writing them to disk, the slowness could be normal.
I would move that over to a nightly job if that's a viable option.
You may want to look at memcached as an alternative to filesystems. Using memory will give a huge performance boost.
http://php.net/memcache/