Retrieving the proper file from a directory with ~280,000 files

Retrieving the proper file from a directory with ~280,000 files - php

I have ~280,000 files that will need to be searched through, and the proper file returned and opened. The file names are exact matches of the expected search terms.
The search terms will be taken by an input box using PHP. What is the best way to accomplish this so that searches do not take a large amount of time?
Thanks!

I suspect the file system itself will struggle with 280,000 files in one directory.
An approach I've taken in the past is to put those files in subdirectories based upon the initial letters of the filename e.g.
1/100000.txt
1/100001.txt
...
9/900000.txt
etc. You can subdivide further using the second letter etc.

Its good you added mysql to your tags. Ideally i would have a CRON task that would index the directories into a mysql table and use that to do the actual search. Algebra is faster than File System iteration. You could run the task daily or hourly depending on how often your files change. Or use something like Guard to monitor the file system for changes and make appropriate updates.
See: https://github.com/guard/guard

Related

Using glob() to fetch filenames without "v[0-9]"

I know how to use glob() to fetch all image files in a directory, but I want to save retrieval time and only fetch the ones I need in the first place.
I am building a car dealership website, and there is a directory where all the vehicle photos get stored. Photos that are associated with a vehicle for sale start with the letter "v" and then the database ID, and then a dot before the model of vehicle.
Here is a sample list of files in a directory:
v313.2014.toyota.camry.0.jpg
v313.2014.toyota.camry.1.jpg
fordfusion.jpg
fordfusion2.jpg
v87.2015.honda.civic.0.jpg
v87.2015.honda.civic.1.jpg
2014.ford.escape.0.jpg
2014.ford.escape.1.jpg
Out of those files, only fordfusion.jpg, fordfusion2.jpg, 2014.ford.escape.0.jpg, 2014.ford.escape.1.jpg should be returned by glob().
I hope this is possible without retrieving all the image files and then going through the array with a regex because 90% of the images being fetched wouldn't be necessary.

I hope this is possible without retrieving all the image files and then going through the array with a regex because 90% of the images being fetched wouldn't be necessary.
Unless there is an extremely large number of files in the directory, this isn't worth worrying about. glob() internally has to iterate through all files in the folder to check their names against the pattern anyway; doing it in PHP code with a regular expression will perform equally well.
If there really is a very large number of files in the directory… don't do that. Large directories perform very poorly in general, and many filesystems have limits on the number of files in a folder. (For instance, the ext3 file system, common on older Linux systems, has a limit of around 32,768 files in a single directory.) Split them up into multiple directories.
To answer the question directly, though, there is no way to do this with a glob() pattern. It's possible to match all the files that do have names starting that way, but there's no way to invert the match. (You could check for [^v]* and v[^0-9]* as two separate patterns, but there's no way to combine them into a single pattern.)

if you're confident that files you don't want to retrieve starts with the letter v or starts with v followed by any digit you can try use the following regex
^[^v]+$|^v[^\d].+
check the Demo

Scan directory tree efficiently by date

What's the most efficient way to grab a list of new files after a given date in php, or perhaps using a system call?
I have full control over how the files are stored as I receive them, so I thought maybe storing them in a folder structure like year/month/day/filename would be best, then all I have to do is scan for the directories greater than or equal to the date I want to retrieve using scandir and casting the directory name to int values. But I am not sure if I'm missing something that would make this easier/faster. I'm interested in the most efficient way of doing this as there will be a lot of files building up over time and I don't want to have to rescan old directories. Basically the directory structure should lend itself well to efficient manual filtering but I wanted to check to see if I'm missing something.
Simple example usage:
'2012/12/1' contains files test1.txt, test2.txt
'2012/12/2' => test3.txt, test4.txt
'2011/11/1' => test5.txt
'2011/11/2' => test6.txt
If I search for files on or after 2011/11/2, then I want everything except test5.txt to be returned.
Thanks in advance for any insight!
edit: the storing and actual processing of files are two separate processes, so I can't just process them as they come in which would obviously be the best solution.

Generally speaking I create directories like YYYY/MM/DD to store my files, often with another level for different sources. Sometimes I'll use YYYY-MM/DD or something similar. Note that there are only 3652 days in a decade, so you could even have a single level like YYYY-MM-DD and not get directories that are so large that they're hard to work with. If you have a filesystem that indexes directories, you can easily have 10s of thousands of files in a directory, otherwise one thousand should probably be your upper limit.
To process the files, I don't bother doing any actual searching of directory names. Since I know what date I'm interested in, I can simply generate the paths and scan only the directories containing files in the proper date range.
For example, let's say I want to process all files for the past week:
for $date = today() - 7 to today():
$path = strftime("%Y/%m/%d", $date)
for $filename in getFiles($path):
processFile($path, $filename)

It looks like you are on either linux or mac based on how you wrote your path.
The find command can return a list of files modified (or accessed) within a certain date.
// find files that were modified less than 30m ago
$filelist = system("find /path/to/files -type f -mmin -30");
I think system calls should be used sparingly since they reduce portability.
Storing in directories as you mentioned makes sense as it will reduce the search space.

Fast access to files

I'm currently building an application that will generate a large number of images (a few tens of thousand of images, possibly more but not in the near future at least). And I want to be able to determine whether a file exists or not and also send it to clients over http (I'm using apache is my web server).
What is the best way to do this? I thought about splitting the images to a few folders and reduce the number of files in each directory. For example lets say that I decide that each file name will begin with a lower letter from the abc. Than I create 26 directories and when I want to look for a file I will add the name of the directory first. For example If I want a file called "funnyimage2.jpg" I will save it inside a directory called "f". I can add layers to that structure if that is required.
To be honest I'm not even sure if just saving all the files in one directory isn't just as good, so if you could add an explanation as to why your solution is better it would be very helpful.
p.s
My application is written in PHP and I intend to use file_exists to check if a file exists or not.

Do it with a hash, such as md5 or sha1 and then use 2 characters for each segment of the path. If you go 4 levels deep you'll always be good:
f4/a7/b4/66/funnyimage.jpg
Oh an the reason its slow to dump it all in 1 directory, is because most filesystems don't store filenames in a B-TREE or similar structure. It will have to scan the entire directory to find a file often times.
The reason a hash is great, is because it has really good distribution. 26 directories may not cut it, especially if lots of images have a filename like "image0001.jpg"

Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]

A directory on a unix file system is just a file that lists filenames and what inode contains the actual file data. As such, scanning a directory for a particular filename boils down to the equivalent operation of opening a text file and scanning for a line with a particular piece of text.
At some point, the overhead of opening that directory "file" and scanning for your filename will outweigh the overhead of using multiple sub-directories. Generally, this won't happen until there's many thousands of files. You should benchmark your system/server to find where the crossover point is.
After that, it's a simple matter of deciding how to split your filenames into subdirectories. If you're allowing only alpha-numeric characters, then maybe a split based on the first 2 characters (1,296 possible subdirs) might make more sense than a single dir with 10,000 files.
Of course, for every additional level of splitting you add, you're forcing the system to open yet another directory "file" and scan for your filename, so don't go too deep on the splits.

your setup is okay. Keep going this way

It seems that you are on the right path. Another post at ServerFault seems to confirm that you are doing the right thing.

I think linux has a limit to the amount of files a directory can contain; it might be best to split them up.
With your method, you can have the same exact image with many different file names. Also, you'll have more images that start with "t" than you would "q" so the directory would still get large. You might want to store them as MD5-HASH.jpg instead. This will eliminate duplicates and have a more even distribution over 36 directories.
Edit: Like Evert mentions, you can do a multi-level directory structure to keep the directory size even smaller.

Keeping track of links or references to image files and deleting unused ones (PHP/Database)

I need a way to remove "unused" images from my filesystem, i.e. images that are never accessed from any point in my website (doesn't matter if I break external links. I might disable external hotlinking altogether). What's the best way of going about this? Regular users can add multiple attachments to topics/posts and content contributers can bulk upload large numbers of images which can be used in articles or image galleries.
The problem is that the images could be referenced in any of the following ways:
From user content (text/html, possibly Markdown or BBCode) stored in the database
Hardcoded into an HTML page
Hardcoded into a PHP file
Hardcoded into a CSS file
As an "attachment" field in a database table, usually containing only the filename itself with no path, because the application assumes that it would be in a certain folder.
And to top it off, the path of the image could be an absolute or relative HTTP or PHP path and may or may not be built with string concatenation in PHP.
So obviously find/replace or regexing the database or filesystem is out of the question. But luckily for you and me, this system isn't fully implemented yet and I don't need anything that deals with an existing hoard of images. I just need to set up some efficient structure that will allow this in the future.
Some ideas I've thought of:
Intercepting the HTTP request for the image with PHP, and keeping track of the HTTP_REFERER. The problem with this is that just because no one has clicked on a link at the time of checking this doesn't mean the link doesn't exist.
Use extreme database normalization - i.e. make a table for images and use foreign keys for anything that references it. However this would result in making a metric craptonne of many-to-many relationships (and the crosstables) in addition to being impractical for any regular user to use.
Backup all the images and delete them, and check every single 404 request and run a script each time that attempts to find the image from the backup folder and puts it in the "real" folder. The problem is that this cache would have to be purged every so often and the server might be strained when rebuilding the cache.
Ideas/suggestions? Is this just something you have to ignore and live with even if you're making a site with a ridiculous amount of images? Even if it's not worth it, how would something work just for proof-of-concept (I added the garbage-collection tag just because this might be going into that area conceptually).

I will admit that my experience with this was simpler than yours. I had no 'user generated content' so to speak, and my images were all in only templates or database with full path. But what I did is create a perl script that
Analyzed my HTML templates, database
table, and CSS generated a list of
files
In the HTML it looked for <img> tags
In the CSS it looked for any .png, .jp*g, or .gif regex strings
The tables were easy because I had an Image table for the image data
The files list was then
ordered to remove duplicates
The script iterated through the list and
wrote a csv like:
filename,(CSS filename|HTML filename|DBTABLE),(exists|notexists) for
auditing
In another iteration it
renamed all files not in the list by
appended .del to the filename
After regression testing I called the
script with a -docleanup tag which
told it to go through and delete all
the .del appended files.
If for whatever reason an image was tagged
as .del and shouldn't have been, I
just manually renamed it back to its
original form.
A couple of notes: I realize that I could have made this script 'smoother' and done multiple things in multiple steps, but its use grew over time and I wanted clearly delineated processing steps so it couldn't ever run amok. I used the CSV to go back and clean up the information where the image didn't exist.

how to organize files created dynamically using php?

I have an PHP website which creates and stores HTML template files on server based on user input.one user can create many templates.So to store the template files and associate them with the DB record ,what I do is-
"templates" is the table which hold other information about the template such as who created it etc. with unique auto-increment id as template_id
for example -
if template id is 1001
I convert it to hex which is 03e9
Now I split the hex number into 03 & e9 (after two numbers) becomes folder and e9 becomes
file with some extension as "e9.tpl"
This is how I can find out template from the file system if I know the template ID.I dont need to separately store the path to the file.
is it a good approach ? any shortfalls of this approach ? is there any other approach better than this ?
What are the advantages / disadvantages of storing the path to file in the database itself ? for example to enable using different discs serving templates etc.?

If the ID in the DB table is already UNIQUE, why transform the id for the filesystem at all?
Just add a file 1001.tpl and you are all set. If you want to have template files sorted into folders, use the User ID (which I assume to be UNIQUE too), so you get folder 124/1001.tpl.
Depending on your deployment process, you will want to keep the created files outside the application folder, so not accidently delete them when updating the application.

Are you doing this because you are worried that you might run out of file entries/inodes in the directory? In ext3 the practical limit is somewhere around 100.000 files (and 32.000 dirs).
Creating a directory structure on the fly is better done using modulu as in $dir = $id % 1000 and then put the new template in that dir ($dir/$id.tpl). That strategy will create max 1000 dirs and you have thus made it possible to handle around 100.000.000 files.
I don't see any reason for messing with hexadecimal values or substrings.

If you have to hit the database to get the id, you may be just as well off storing the template in it as well. But there's nothing categorically wrong with storing them on the file system. I generally would.
When you hit 65,536, you'll get 0x10000. Make sure your code can handle that. I'd be more apt to store 0x1234 like: 1/1234.tpl, just for the sake of clarity. Note that by virtue of sequential IDs, your folders will fill up sequentially.
I'd probably not even convert them to hex. You could use a modulus operator to determine which folder to put them in. Figure out how many files you are likely to have and use that to determine how many folders you want.
For example:
$path = ($id % NUMBER_OF_FOLDERS) . "/$id.tpl"
where $id is the template id in decimal.

I don't understand the point in separating the hex into two parts to create different folders... That could create hundreds and hundreds of different folders which would become a complete mess on your server. Why not just store on the templates in one single folder with the hex value as the file name, such as 03e9.tpl?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.