Basically i have simple form which user uses for files uploading. Files should be stored under /files/ directory with some subdirectories for almost equally splitting files. e.g. /files/sub1/sub2/file1.txt
Also i need to not to store equal files (by filename).
I have own solution. Calculate sha1 from filename. Take first 5 symbols - abcde for example and put file in /files/a/b/c/d/e/ this works well, but gives situation when one folder contains 4k files, 2nd 6k files. Is there any way to make files count be more closer to each other? Max files count can be 10k or 10kk.
Thanks for any help.
P.S. May be i explained something wrong, so once again :) Task is simple - you have only html and php (without any db) and files directory where you should store only uploaded files without any own data. You should develop script that can handle storing uploads to files directory without storing duplicates (by filename) and split uploaded files by subdirectories by files count in each directory (optimal and count files in each directory should be close to each other).
I have no idea why you want it taht way. But if you REALLY have to do it this way, iI would suggest you set a limit how many bytes are stored in each folder. Everytime you have to save the data you open a log with
the current sub
the total number of bytes written to that directory
If necesary you create a new sub diretory(you coulduse th current timestempbecause it wont repeat) and reset the bytecount
Then you save the file and increment the byte count by the number of bytes written.
I highly doubt it is worth the work, but I do not really know why you want to distribute the files that way.
Related
This is a completely theoretical question.
I have a photo storage site in which photos are uploaded by users registered in the website.
The Question
Which of the approach is faster ?
And better for a long term when i need to use a lot of computers and
hard disks?
Is any other approach is there that's even better ?
Now i have thought of two approaches of accomplishing that stuff.
Files uploaded to my server is expected to be huge ~>100 million
Approach 1
These two /pictures/hd/ & /pictures/low/ directories will contain all the files uploaded by the user.
$newfilename = $user_id.time().$filename; //$filename = actual filename of uploaded file
$src = '/pictures/hd/'.$newfilename; //for hd pics
Inserting that into mysql by
insert into pics(`user_id`,`src`)VALUES('$user_id','$newfilename')
Approach 2
These two /pictures/hd/ & /pictures/low/ directories will contain sub-directories of the files uploaded by the user.
This is going to create lots of subdirectories with the name as user_id of the user who is uploading the file into the server.
if (!is_dir('/pictures/hd/'.$user_id.'/')) {
mkdir('/pictures/hd/'.$user_id.'/');
}
$newfilename = $user_id.'/'.$user_id.time().$filename; //$filename = actual filename of uploaded file
$src = '/pictures/hd/'.$newfilename; //for hd pics
Inserting that into mysql by
insert into pics(`user_id`,`src`)VALUES('$user_id','$newfilename')
Retrieval
When retrieving the image i can use the src column of my pics table to get the filename and explore the hd file using the '/pictures/hd/'.$src_of_picstable and lowq files using '/pictures/low/'.$src_of_picstable
The right way to answer the question is to test it.
Which is faster will depend on the number of files and the underlyng filesystem; ext3,4 will quite happily cope with very large numbers of files in a single directory (dentries atr managed in an HTree index). Some filesystems just use simple lists. Others have different ways of optimizing file access.
Your first problem of scaling will be how to manage the file set across multiple disks. Just extending a single filesystem across lots of disks is a bad idea. If you have lots of directories, then you can have lots of mount points. But this doesn't work all that well when you get to terrabytes of data.
However that the content is indexed independently of the file storage means that it doesn't matter what you choose now for your file storage, because you can easily change the mapping of files to location later without having to move your existing dataset around.
I wouldn't suggest single directory approach for two reasons. Firstly, if you're planning to have a lot of images your directory will get really big. And searching for a single image manually will take a lot longer. This will be needed when you debug something ir test new features.
Second reason for multiple directories is that you can smaller backups of part of your gallery. And if you have really big gallery(lets say several terabytes) single hard drive might not be enough to contain them all. With multiple directories you can mount each directory on separate hard drive and this way handle almost infinite size gallery.
My favorite approach is YYYY/MM/type-of-image directory structure. This way you can spot when did you introduce some bug by looking month by month. Also you can make monthly backups without duplicating redundant files. Also making quarterly snapshots of all gallery just in case.
Also about type-of-image there are several types of images that I might need such as original image, small thumbnail, thumbnail, normal image and etc. This way i can just swap type of image and get different image size.
As for you I would suggest YYYY/MM/type-of-image/user_id approach where you could easily find all user uploaded files in one place.
I am developing a social system where users can upload images.
I am not very sure how I should structure my files. The two ideas I have is the following:
Add a folder with the userID as name and put all the users images in there
Have a single folder for all images, and give each image a unique name
Which one of the two would be better? Is it the proper way to organize images?
It depends.
When you use have a gallery with images stored in a directory per user, you also have to worry about duplicates (say user will upload several files with the same name).
On the other hand in the first case you might run into performance issues related to number of files in a single directory (if you have more than 10k of files it might lag).
Solution I like is is to create a unique name and crop it into several parts to create a directory structure.
For example:
Generate from a file image.jpg (look into the manual) a unique name, say
nfsr53a5gb
Add to it the original extension so it becomes nfsr53a5gb.jpg
Split it with / into nf/sr/53/a5/gb.jpg
Create missing folders as you go (see recursive parameter)
You won't hit the penalty for number of files in a directory soon and you won't get a collision and files have URLs difficult to guess.
A nice touch is to add a controller for getting those files which will change the name to the original name (store it in database and switch headers). Use it only for a dedicated download button as this might be too CPU and I/O intensive for many images embedded on a page.
In order to do this you have to modify headers like this:
Content-Disposition: attachment; filename=YOUR_FILE_NAME.YOUR_EXTENSION
Content-Type: application/octet-stream
Go for #2 option, OR if number of user increases number folders will keep on increse.
For fetching images in your code also this method will be simpler.
I am building a site that is looking at Millions of photos being uploaded easily (with 3 thumbnails each for each image uploaded) and I need to find the best method for storing all these images.
I've searched and found examples of images stored as hashes.... for example...
If I upload, coolparty.jpg, my script would convert it to an Md5 hash resulting in..
dcehwd8y4fcf42wduasdha.jpg
and that's stored in /dc/eh/wd/dcehwd8y4fcf42wduasdha.jpg
but for the 3 thumbnails I don't know how to store them
QUESTIONS..
Is this the correct way to store these images?
How would I store thumbnails?
In PHP what is example code for storing these images using the method above?
How am I using the folder structure:
I'm uploading the photo, and move it like you said:
$image = md5_file($_FILES['image']['tmp_name']);
// you can add a random number to the file name just to make sure your images will be "unique"
$image = md5(mt_rand().$image);
$folder = $image[0]."/".$image[1]."/".$image[2]."/";
// IMAGES_PATH is a constant stored in my global config
define('IMAGES_PATH', '/path/to/my/images/');
// coolparty = f3d40fc20a86e4bf8ab717a6166a02d4
$folder = IMAGES_PATH.$folder.'f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// thumbnail, I just append the t_ before image name
$folder = IMAGES_PATH.$folder.'t_f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// move_uploaded_file(), with thumbnail after process
// also make sure you create the folders in mkdir() before you move them
I do believe is the base way, of course you can change the folder structure to a more deep one, like you said, with 2 characters if you will have millions of images.
The reason you would use a method like that is simply to reduce the total number of files per directory (inodes).
Using the method you have described (3 levels deeps) you are very unlikely to reach even hundreds of images per directory since you will have a max number of directories of almost 17MM. 16**6.
As far as your questions.
Yeah, that is a fine way to store them.
The way I would do it would be
/aa/bb/cc/aabbccdddddddddddddd_thumb.jpg
/aa/bb/cc/aabbccdddddddddddddd_large.jpg
/aa/bb/cc/aabbccdddddddddddddd_full.jpg
or similar
There are plenty of examples on the net as far as how to actually store images. Do you have a more specific question?
If you're talking millions of photos, I would suggest you farm these off to a third party such as Amazon Web Services, more specifically for this Amazon S3. There is no limit for the number of files and, assuming you don't need to actually list the files, there is no need to separate them into directories at all (and if you do need to list, you can use different delimeters and prefixes - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ListingKeysHierarchy.html). And your hosting/rereival costs will probably be lower than doing yourself - and they get backed up.
To answer more specifically, yes, split by sub directories; using your structure, you can drop the first 5 characters of the filename as you alsready have it in the directory name.
And thumbs, as suggested by aquinas, just appent _thumb1 etc to the filename. Or store in separate folders themsevles.
1) That's something only you can answer. Generally, I prefer to store the images in the database so you can have ONE consistent backup, but YMMV.
2) How? How about /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb1.jpg, /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb2.jpg and /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb3.jpg
3) ??? Are you asking how to write a file to the file system or...?
Improve Answer.
For millions of Images, as yes, it is correct that using database will slow down the process
The best option will be either use "Server File System" to store images and use .htaccess to add security.
or you can use web-services. many servers like provide Images Api for uploading, displaying.
You can go on that option also. For example Amazon
I'm currently building an application that will generate a large number of images (a few tens of thousand of images, possibly more but not in the near future at least). And I want to be able to determine whether a file exists or not and also send it to clients over http (I'm using apache is my web server).
What is the best way to do this? I thought about splitting the images to a few folders and reduce the number of files in each directory. For example lets say that I decide that each file name will begin with a lower letter from the abc. Than I create 26 directories and when I want to look for a file I will add the name of the directory first. For example If I want a file called "funnyimage2.jpg" I will save it inside a directory called "f". I can add layers to that structure if that is required.
To be honest I'm not even sure if just saving all the files in one directory isn't just as good, so if you could add an explanation as to why your solution is better it would be very helpful.
p.s
My application is written in PHP and I intend to use file_exists to check if a file exists or not.
Do it with a hash, such as md5 or sha1 and then use 2 characters for each segment of the path. If you go 4 levels deep you'll always be good:
f4/a7/b4/66/funnyimage.jpg
Oh an the reason its slow to dump it all in 1 directory, is because most filesystems don't store filenames in a B-TREE or similar structure. It will have to scan the entire directory to find a file often times.
The reason a hash is great, is because it has really good distribution. 26 directories may not cut it, especially if lots of images have a filename like "image0001.jpg"
Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]
A directory on a unix file system is just a file that lists filenames and what inode contains the actual file data. As such, scanning a directory for a particular filename boils down to the equivalent operation of opening a text file and scanning for a line with a particular piece of text.
At some point, the overhead of opening that directory "file" and scanning for your filename will outweigh the overhead of using multiple sub-directories. Generally, this won't happen until there's many thousands of files. You should benchmark your system/server to find where the crossover point is.
After that, it's a simple matter of deciding how to split your filenames into subdirectories. If you're allowing only alpha-numeric characters, then maybe a split based on the first 2 characters (1,296 possible subdirs) might make more sense than a single dir with 10,000 files.
Of course, for every additional level of splitting you add, you're forcing the system to open yet another directory "file" and scan for your filename, so don't go too deep on the splits.
your setup is okay. Keep going this way
It seems that you are on the right path. Another post at ServerFault seems to confirm that you are doing the right thing.
I think linux has a limit to the amount of files a directory can contain; it might be best to split them up.
With your method, you can have the same exact image with many different file names. Also, you'll have more images that start with "t" than you would "q" so the directory would still get large. You might want to store them as MD5-HASH.jpg instead. This will eliminate duplicates and have a more even distribution over 36 directories.
Edit: Like Evert mentions, you can do a multi-level directory structure to keep the directory size even smaller.
I'm building a torrent site where users can upload torrents.
What would be a good way to save the .torrent files?
I can think of several options:
Saving the torrent file itself in a folder on the server (not the best option since OS's have limitations saving lots of files in 1 folder)
Saving the torrent file itself in different folders per month
Saving the contents of the torrent file in the database (any limitations / performance issues / any other caveats?)
Any other options?
If you're concerned about having too much files within a directory, you need to distribute the files across multiple directories. Storing them by month, day or week is one way how to do so. It depends a bit how many files you really have I would say.
You can try to more or less equally distribute the files within subdirectories by hashing their filename and use the whole or part of the hash to generate one or multiple subdirectory names:
$hash = md5($fileName);
$srotePath = sprintf('%s/%s', substr($hash,0,2), $fileName);
This would pick the first two character from an md5 hash (00-ff, 256 subdirectories) to generate the subdirectory.
The benefit compared with a date is, that you always can find out in which directory a file is stored when you have it's name.
That does also mean, that you can not have duplicate files with the same name (which might have worked for the date based subfolder).
I'm use this:
Saving the torrent file itself in different folders per month
using database is not at all good. just save them as static files and maybe even gzip them .
just make sure to rename them uniquely with some kind of hashing .
if you don't have any problem in using external provider you can use TorCache
I would say saving the .torrent file in a weekly/monthly folder is the best option.
That way you can use the OS' filesystem cache, even if you store the .torrents outside the document root for limiting user access (in the end you will have to open() the file anyway)
Leaving torrents in the database would eventually lead to slow performance as the DB increases in size.
May be you try Amazon S3? It's cheap, easy and fast.
Uploading them automatically saves .torrent files. http://www.tizag.com/phpT/fileupload.php has a good example. Give it a try.