Storage Performance - php

I want to enable users to upload images & videos to a website.
Question is now, shall I just drop all the files in one folder or make a folder e.g. for each user?
(Of course it would be easier to find)
Does it make a difference in the performance?
Is it any difference in the access rate?
thanks

If you're dealing with many files, it's a common principle to distribute the files across multiple (sub-) directories. This is because if a directory contains too many directories and files, the file-system needs to do more work. Distributing the files helps then.
But this always depends on the underlying-file-system you use as a database. You need to look which one you use and then check the features it supports and which limits are given.
On the application layer, you should model file-access and handling so you can change how your application stores the files later on w/o rewriting your whole application.

Related

Storing images on linux

I'm building website with billions of images. I'm little confused to store images in single directory. how many images can be stored in single directory. or it slow down the server ?
Have you considered object storage such as AWS S3? http://aws.amazon.com/s3/
As for performance, I think it depends on the file system you intend to use. Some file systems index directory content in a linear manner, others use more efficient algorithms. It also depends whether any of your system services will need to scan the directory regularly.
I found this: http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler.pdf
In this question: https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory

Preventing directory scanning from Acunetix

I have a PHP enabled site, with directory-listing turned off.
But, when I used Acunetix: (web vulnerability scanning software) to scan my site, and other high-profile websites, it was able to list all directories & files.
I don't know what this is happening, but I have this theory: maybe the software is using English words, trying to see if a folder exists by trying names like "include/", "css/", "/images", etc. Then, maybe it is able to list files that way.
Because, if directory listing is off, I don't know what more there is to do.
So, I devised this plan, that if I give my folders/files difficult names like I3Nc_lude, 11css11, etc., maybe it would be difficult for the software to find the names. What do you think?
I know, I could be dead-wrong about this, and the idea might be laughable but, that is why I am asking for help.
How do you Completely! Forbid directory listing??
Ensure all directories from the root of your site have directory
listings disabled. It is typically on by default when you setup a
new server.
Assuming that directory listing in your webserver is not your issue,
keep in mind that any resources you have in your site: CSS files, JS
sources, and of course HREFs can be traversed with little or no
effort (typically a few lines of javascript). There is no way to
hide anything that you've referenced. This is most likely what you
are seeing reflected in the scan.
Alternatively, if you use SVN or other version control systems to
deploy your site, often these can be used to determine the path of
every file in your codebase.
Probably the most common mistake people make when first creating sites is that they keep all their files in the webroot, and it becomes somewhat trivial to figure out where things are.
IMHO the best approach is have your code in a separate directory outside the webroot, and then load it as needed (this is how most MVC frameworks work). You can control entirely then what can and can not be accessed via the web. You can have 100s of classes in a directory and as long as they are not in the webroot, no one will ever be able to see them, even if directory listing were to become enabled.
The checkers aren't using some kind of language-based brute force attack, that would be far too costly and invasive even for the most inept hacker. Your internet file sharing service (Apache, IIS, whatever) is serving up the structure to anyone who asks.
I found this solution at - it should apply to you, I hope.
http://www.velvetblues.com/web-development-blog/dont-get-hacked-6-ways-to-secure-your-wordpress-blog/
Hide Your Directory Structure
It is also good practice to hide your directory structure. By default, many WordPress installations enable any visitors to snoop and see all files in folders lacking an index file. And while this might not seem dangerous, it really is. By enabling visitors to see what files are in each directory, they can better plot their attack.
To fix this problem, you can do one of two things:
Option 1: Use An Index File
For each directory that you want to protect, simply add an index file. A simple index.html file will suffice.
Option 2: Use An .htaccess File
The preferred way of hiding the directory structure is to use the following code in an .htaccess file.
Options -indexes
That just sounds like a nightmare to manage. Focus on securing the files the best you can with all preventative measures. Don't rely on security through obscurity. If someone wants in, some random directory names will just slow them down slightly

Will too many files storing in one folder make HTTP request for one of them slow?

I've got nearly a million images for my site and they are stored in one folder on my windows server.
Since opening this folder directly on desktop drive me and my CPU crazy I am wondering that whether fetching one of them using my PHP script for a HTTP request is also laborious. So, will separating them into different folders improve the performance?
No, the performance does not depend on the number of files that are in a directory. The reason why opening the folder in Windows explorer is slow is because it has to render icons and various other GUI related things for each file.
When the web server fetches a file, it doesn't need to do that. It just (more or less) directly goes to the location of the file on the disk.
EDIT: Millions is kind of pushing the limits of your file system (I assume NTFS in your case). It appears that anything over 10,000 files in a directory starts to degrade your performance. So not only from a performance standpoint, but from an organizational standpoint as well, you may want to consider separating them into subdirectories.
Often the best answer in a case like this is to benchmark it. It shouldn't be too hard to create a program that opens 1000 hard-coded file names and closes them. Run the test on your million-plus directory and another directory containing only those 1000 files being tested and see if there's a difference.
Not only does the underlying file system make a difference, but accessing over a network can affect the results too.
Separating your files into seperate directories will most likely help performance. But as mark suggests it's probably worth benchmarking

Many files in one directory?

I develop some PHP project on Linux platform. Are there any disadvantages of putting several thousand images (files) in one directory? This is closed set which won't grow. The alternative would be to separate this files using directory structure based on some ID (this way there would be let's say only 100 in one directory).
I ask this question, because often I see such separation when I look at images URLs on different sites. You can see that directory separation is done in such way, that no more then several hundreds images are in one directory.
What would I gain by not putting several thousand files (of not growing set) in one directory but separating them in groups of e.g. 100? Is it worth complicating things?
UPDATE:
There won't be any programmatic iteration over files in a directory (just a direct access to an image by it's filename)
I want to emphasize that the image set is closed. It's less then 5000 images, and that is it.
There is no logical categorization of this images
Human access/browse is not required
Images have unique filenames
OS: Debian/Linux 2.6.26-2-686, Filesystem: ext3
VALUABLE INFORMATION FROM THE ANSWERS:
Why separate many files to different directories:
"32k files limit per directory when using ext3 over nfs"
performance reason (access speed) [but for several thousand files it is difficult to say if it's worth, without measuring ]
In addition to faster file access by separating images into subdirectories, you also dramatically extend the number of files you can track before hitting the natural limits of the filesystem.
A simple approach is to md5() the file name, then use the first n characters as the directory name (eg, substr(md5($filename), 2)). This ensures a reasonably even distribution (vs taking the first n characters of the straight filename).
usually the reason for such splitting is file system performance.
for a closed set of 5000 files I am not sure it's worth the hassle.
I suggest that you try the simple approach of putting all the files in one directory thing, but keep an eye open on the actual time it takes to access the files.
if you see that it's not fast enough for your needs, you can split it like you suggested.
I had to split files myself for performance reasons.
in addition I bumped into a 32k files limit per directory when using ext3 over nfs (not sure if it's a limit of nfs or ext3).
so that's another reason to split into multiple directories.
in any case, try with a single dir and only split if you see it's not fast enough.
There is no reason to split those files into multiple directories, if you won't expect any filename conflicts and if you don't need to iterate over those images at any point.
But still, if you can think of a suggestive categorization, it's not a bad idea to sort the images a bit, even if it is just for maintenance reasons.
I think there is two aspects to this question:
Does the Linux file system that you're using efficiently support directories with thousands of files. I'm not an expert, but I think the newer file systems won't have problems.
Are there performance issues with specific PHP functions? I think direct access to files should be okay, but if you're doing directory listings then you might eventually run into time or memory problems.
The only reason I could imagine where it would be detrimental was when iterating over the directory. More files, means more iterations. But that's basically all I can think of from a programming perspective.
Several thousand images are still okay. When you access a directory, operating systems reads the listing of its files by blocks of 4K. If you have plain directory structure, it may take time to read the whole file listing if there are many (e. g. hundred thousand) files in it.
If changing the filesystem is an option, I'd recommend moving wherever you store all the images to a ReiserFS filesystem. It is excellent at fast storage/access of lots of small files.
If not, MightyE's response of breaking them into folders is most logical and will increase access times by a considerable margin.

One code, many websites

I need to develop a project that would allow me to instance many copies of a website, but each copy needs to be a separate website. I could upload the same code to many different accounts, but I would prefer to have only one copy of the code. Each website would be an "instance", so to speak. This way I could upload the code once and update all the websites at the same time.
For technical reasons I need to use PHP (but I'm interested in the other options too, for my own knowledge), and I thought Jelix could be a good choice of framework. Are there better options out there?
You can have all code in one directory, and then create virtual subdirectories in all your web sites, which all point to this directory. This is how Microsoft solves the problem in SharePoint.
The easiest bet is to have all the websites link to one server (perhaps distributed).
Pass the calling URL through your webserver to generate configuration information. Use those passed URLs to define the differences between each site.
Beyond that, the framework is almost immaterial to the question, so I'll leave it to someone else to answer.
Just remember, if you make 20 copies of the same code, that's 20x the time it'll take to fix bugs.
If you're using UNIX or Linux for a web server, you could create one master copy of the PHP code, and then use symbolic links to the actual files that are in separate directories with virtual websites set up in Apache. You could also put site-specific config files under those directories, but the bulk of the PHP code would be resolved as symbolic links to the "master" code.
I'm not sure what kind of websites you're talking about, but why not use an already developed application like Wordpress or any other cms? The code is identical on every website, and you can easily update it. The website-specific data is only present in the single configuration file, and the MySQL database.

Categories