Fast access to files - php

I'm currently building an application that will generate a large number of images (a few tens of thousand of images, possibly more but not in the near future at least). And I want to be able to determine whether a file exists or not and also send it to clients over http (I'm using apache is my web server).
What is the best way to do this? I thought about splitting the images to a few folders and reduce the number of files in each directory. For example lets say that I decide that each file name will begin with a lower letter from the abc. Than I create 26 directories and when I want to look for a file I will add the name of the directory first. For example If I want a file called "funnyimage2.jpg" I will save it inside a directory called "f". I can add layers to that structure if that is required.
To be honest I'm not even sure if just saving all the files in one directory isn't just as good, so if you could add an explanation as to why your solution is better it would be very helpful.
p.s
My application is written in PHP and I intend to use file_exists to check if a file exists or not.

Do it with a hash, such as md5 or sha1 and then use 2 characters for each segment of the path. If you go 4 levels deep you'll always be good:
f4/a7/b4/66/funnyimage.jpg
Oh an the reason its slow to dump it all in 1 directory, is because most filesystems don't store filenames in a B-TREE or similar structure. It will have to scan the entire directory to find a file often times.
The reason a hash is great, is because it has really good distribution. 26 directories may not cut it, especially if lots of images have a filename like "image0001.jpg"

Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]

A directory on a unix file system is just a file that lists filenames and what inode contains the actual file data. As such, scanning a directory for a particular filename boils down to the equivalent operation of opening a text file and scanning for a line with a particular piece of text.
At some point, the overhead of opening that directory "file" and scanning for your filename will outweigh the overhead of using multiple sub-directories. Generally, this won't happen until there's many thousands of files. You should benchmark your system/server to find where the crossover point is.
After that, it's a simple matter of deciding how to split your filenames into subdirectories. If you're allowing only alpha-numeric characters, then maybe a split based on the first 2 characters (1,296 possible subdirs) might make more sense than a single dir with 10,000 files.
Of course, for every additional level of splitting you add, you're forcing the system to open yet another directory "file" and scan for your filename, so don't go too deep on the splits.

your setup is okay. Keep going this way

It seems that you are on the right path. Another post at ServerFault seems to confirm that you are doing the right thing.

I think linux has a limit to the amount of files a directory can contain; it might be best to split them up.
With your method, you can have the same exact image with many different file names. Also, you'll have more images that start with "t" than you would "q" so the directory would still get large. You might want to store them as MD5-HASH.jpg instead. This will eliminate duplicates and have a more even distribution over 36 directories.
Edit: Like Evert mentions, you can do a multi-level directory structure to keep the directory size even smaller.

Related

Using glob() to fetch filenames without "v[0-9]"

I know how to use glob() to fetch all image files in a directory, but I want to save retrieval time and only fetch the ones I need in the first place.
I am building a car dealership website, and there is a directory where all the vehicle photos get stored. Photos that are associated with a vehicle for sale start with the letter "v" and then the database ID, and then a dot before the model of vehicle.
Here is a sample list of files in a directory:
v313.2014.toyota.camry.0.jpg
v313.2014.toyota.camry.1.jpg
fordfusion.jpg
fordfusion2.jpg
v87.2015.honda.civic.0.jpg
v87.2015.honda.civic.1.jpg
2014.ford.escape.0.jpg
2014.ford.escape.1.jpg
Out of those files, only fordfusion.jpg, fordfusion2.jpg, 2014.ford.escape.0.jpg, 2014.ford.escape.1.jpg should be returned by glob().
I hope this is possible without retrieving all the image files and then going through the array with a regex because 90% of the images being fetched wouldn't be necessary.
I hope this is possible without retrieving all the image files and then going through the array with a regex because 90% of the images being fetched wouldn't be necessary.
Unless there is an extremely large number of files in the directory, this isn't worth worrying about. glob() internally has to iterate through all files in the folder to check their names against the pattern anyway; doing it in PHP code with a regular expression will perform equally well.
If there really is a very large number of files in the directory… don't do that. Large directories perform very poorly in general, and many filesystems have limits on the number of files in a folder. (For instance, the ext3 file system, common on older Linux systems, has a limit of around 32,768 files in a single directory.) Split them up into multiple directories.
To answer the question directly, though, there is no way to do this with a glob() pattern. It's possible to match all the files that do have names starting that way, but there's no way to invert the match. (You could check for [^v]* and v[^0-9]* as two separate patterns, but there's no way to combine them into a single pattern.)
if you're confident that files you don't want to retrieve starts with the letter v or starts with v followed by any digit you can try use the following regex
^[^v]+$|^v[^\d].+
check the Demo

Store images in directories and store a reference to the images in the database

I am currently involved in a project to create a website which allows users to share and rate images, for their creative media course.
I am trying to find ways to save images to a mysql database. I know i can save images as blobs, but this won't work as i plan on only allowing users to save high res images. Therefore, i've tried to find out how to store images in a directory/server folder and store references to the images in the database. An added complication to he matter, is that the reference must automatically save within a mysql database table.
Does anyone know how to go about this? or point me in the right direction?
Thanks!
I've actually built a similar website (mass image uploader) so I can speak from experience.
Keeping track of the files
Save the image file as-is on disk and save the path to the file in the database. This part should be pretty straightforward.
One disadvantage is that you need a database lookup for every image, but if your table is well optimized (indexes) this should be no real problem.
There are many advantages, such as your files become easily referable and you can add meta data to your files (like number of views).
Filenames
Now, saving files, lots of files, is not immediately straightforward.
If you don't care at all about filenames just generate a random hash like:
$filename = md5(uniqid()); // generate a random hash, mileage may vary
This gets ride of all kind of filename related issues like duplicate filenames, unsupported characters etc.
If you want to preserve the filename, store the filename in the database.
If you want your filename on disk to also be somewhat human readable I would go for a mixed approach: partly hash, partly original filename. You will need to filter unsupported characters (like /), and perhaps transliterate similar characters (like é -> e and ß -> ss). Foreign languages such as Chinese and Hebrew can give interesting results, so be aware of that. You could also encode any foreign character (like base64_encode) but that doesn't do much for readability.
Finally, be aware of filepath length constraints. Filenames and filepaths can not be infinitely long. I believe Windows is 255 for the full path.
Buckets
You should definitely consider using buckets because OSes (and humans) don't like folders with thousands of files.
If you're using hashes you already have a convenient bucket scheme available.
If your hash is 0aa1ea9a5a04b78d4581dd6d17742627
Your bucket(s) can be: 0/a/a/1/e/a9a5a04b78d4581dd6d17742627. In this case there are have 5 nested buckets. which means you can expect to have one file in each bucket after 16^5 (~1 million) files. How many levels of buckets you need is up to you.
Mime-type
It's also good to keep track of the original file extension / mime-type. If you only have one kind of mime-type (like TIFF) then you don't need to worry about it. Most files have some way to easily detect that it's a file in that format but you don't want to have to rely on that. PNGs start with PNG (open one with a text editor to see it).
Relative path vs absolute path
I would also recommend saving the relative path to the files, not the absolute path. This makes maintenance much easier.
So save:
0/a/a/1/e/a9a5a04b78d4581dd6d17742627
instead of:
/var/www/wwwdata/images/0/a/a/1/e/a9a5a04b78d4581dd6d17742627

Efficient Directory Structure

What would be the best directory structure for a massive amount of file.
Considering i have more than 20 million of files using number_id as file names (ex. 13842985.xml).
if would go with something like
filename : 13842985.xml
directory : 1/3/8/13842985.xml
How can i do this properly wherein all files are scattered evenly on each directories and subdirectories.
You could create the directory structure like a trie.
Change your method slightly to this instead:
filename : 13842985.xml
directory : 842/985/13842985.xml # use the 6 last to create the directory name
I am assuming the filenames are somewhat random. This scheme will create 1000 top folders, each containing 1000 subfolders. By starting from the last digits instead of the first, you will be protected against long filenames:
filename : 138429851234.xml
directory : 851/234/138429851234.xml
Hope this helps!
Edit: By hashing the filename first and using this number instead, you'll avoid degenerate cases (for example, varying only in the beginning).
Do some benchmarking to figure out where the tradeoff between having to scan through multiple directorie becomes cheaper than having to scan through "many" files in a single directory.
At some point the file system overhead of opening/scanning/security-checking/etc... on each directory layer you add will be higher than the savings from having to parse the directory to find the single file you want. That's the level you'd do your split/layering cutoff.

Structurizing files without db

Basically i have simple form which user uses for files uploading. Files should be stored under /files/ directory with some subdirectories for almost equally splitting files. e.g. /files/sub1/sub2/file1.txt
Also i need to not to store equal files (by filename).
I have own solution. Calculate sha1 from filename. Take first 5 symbols - abcde for example and put file in /files/a/b/c/d/e/ this works well, but gives situation when one folder contains 4k files, 2nd 6k files. Is there any way to make files count be more closer to each other? Max files count can be 10k or 10kk.
Thanks for any help.
P.S. May be i explained something wrong, so once again :) Task is simple - you have only html and php (without any db) and files directory where you should store only uploaded files without any own data. You should develop script that can handle storing uploads to files directory without storing duplicates (by filename) and split uploaded files by subdirectories by files count in each directory (optimal and count files in each directory should be close to each other).
I have no idea why you want it taht way. But if you REALLY have to do it this way, iI would suggest you set a limit how many bytes are stored in each folder. Everytime you have to save the data you open a log with
the current sub
the total number of bytes written to that directory
If necesary you create a new sub diretory(you coulduse th current timestempbecause it wont repeat) and reset the bytecount
Then you save the file and increment the byte count by the number of bytes written.
I highly doubt it is worth the work, but I do not really know why you want to distribute the files that way.

List image files using PHP -- and be case-sensitive

A drop-box directory for image files has collected variants by letter-case, for example:
Bonsai.jpg, BONSAI.jpg, Bonsai.JPG, bonsai.jpg
I am making a web app using CodeIgniter to manage these documents on a remote server. This means using
file_exists() or is_file() to verify
a file's presence
HTML img tag to display the file graphically
But both these tools use the first match they find, regardless of case. How can I deal with this?
(I noticed this similar question as this, but for Delphi instead of PHP.)
But both these tools use the first match they find, regardless of case
They definitely shouldn't - at least not on a file system that is case sensitive, like Linux's default file system (is it still called ext2?). While it's questionable practice to have those four file in the same directory IMO, neither file_exists() nor the serving of web resources should show the behaviour you describe.
It's different on Windows: FAT and NTFS are not case sensitive. In your example, only one of the four files you mention can exist in the same directory.
When accepting images I always rename them, for example using CI's encrypt filenames option of the File Upload class to avoid these kind of problems. Otherwise it can turn in to a big headache.
EDIT: added my comment on the OP below
You can easily write a script that puts all filenames in to an array, identify duplicates and append _1 to their name. Now you have just unique filenames. Then you convert all to lowercase. For all existing files and new ones you encrypt the filenames to a 32 character string. Batch processing of filenames like this is actually quite easy. Just keep a back up of all files just in case, and very little can go wrong.
Codeigniter has some useful functions like the file helper's get_filenames() which puts all files in a specified directory in to an array, and the security helper's dohash() which would encrypt the filenames. For future uploads set encrypt_name preference to TRUE

Categories