Folder Structure for storing millions of images? - php

I am building a site that is looking at Millions of photos being uploaded easily (with 3 thumbnails each for each image uploaded) and I need to find the best method for storing all these images.
I've searched and found examples of images stored as hashes.... for example...
If I upload, coolparty.jpg, my script would convert it to an Md5 hash resulting in..
dcehwd8y4fcf42wduasdha.jpg
and that's stored in /dc/eh/wd/dcehwd8y4fcf42wduasdha.jpg
but for the 3 thumbnails I don't know how to store them
QUESTIONS..
Is this the correct way to store these images?
How would I store thumbnails?
In PHP what is example code for storing these images using the method above?

How am I using the folder structure:
I'm uploading the photo, and move it like you said:
$image = md5_file($_FILES['image']['tmp_name']);
// you can add a random number to the file name just to make sure your images will be "unique"
$image = md5(mt_rand().$image);
$folder = $image[0]."/".$image[1]."/".$image[2]."/";
// IMAGES_PATH is a constant stored in my global config
define('IMAGES_PATH', '/path/to/my/images/');
// coolparty = f3d40fc20a86e4bf8ab717a6166a02d4
$folder = IMAGES_PATH.$folder.'f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// thumbnail, I just append the t_ before image name
$folder = IMAGES_PATH.$folder.'t_f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// move_uploaded_file(), with thumbnail after process
// also make sure you create the folders in mkdir() before you move them
I do believe is the base way, of course you can change the folder structure to a more deep one, like you said, with 2 characters if you will have millions of images.

The reason you would use a method like that is simply to reduce the total number of files per directory (inodes).
Using the method you have described (3 levels deeps) you are very unlikely to reach even hundreds of images per directory since you will have a max number of directories of almost 17MM. 16**6.
As far as your questions.
Yeah, that is a fine way to store them.
The way I would do it would be
/aa/bb/cc/aabbccdddddddddddddd_thumb.jpg
/aa/bb/cc/aabbccdddddddddddddd_large.jpg
/aa/bb/cc/aabbccdddddddddddddd_full.jpg
or similar
There are plenty of examples on the net as far as how to actually store images. Do you have a more specific question?

If you're talking millions of photos, I would suggest you farm these off to a third party such as Amazon Web Services, more specifically for this Amazon S3. There is no limit for the number of files and, assuming you don't need to actually list the files, there is no need to separate them into directories at all (and if you do need to list, you can use different delimeters and prefixes - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ListingKeysHierarchy.html). And your hosting/rereival costs will probably be lower than doing yourself - and they get backed up.
To answer more specifically, yes, split by sub directories; using your structure, you can drop the first 5 characters of the filename as you alsready have it in the directory name.
And thumbs, as suggested by aquinas, just appent _thumb1 etc to the filename. Or store in separate folders themsevles.

1) That's something only you can answer. Generally, I prefer to store the images in the database so you can have ONE consistent backup, but YMMV.
2) How? How about /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb1.jpg, /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb2.jpg and /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb3.jpg
3) ??? Are you asking how to write a file to the file system or...?

Improve Answer.
For millions of Images, as yes, it is correct that using database will slow down the process
The best option will be either use "Server File System" to store images and use .htaccess to add security.
or you can use web-services. many servers like provide Images Api for uploading, displaying.
You can go on that option also. For example Amazon

Related

Multiple Small Directories Or One Huge Directory with file naming php mysql

This is a completely theoretical question.
I have a photo storage site in which photos are uploaded by users registered in the website.
The Question
Which of the approach is faster ?
And better for a long term when i need to use a lot of computers and
hard disks?
Is any other approach is there that's even better ?
Now i have thought of two approaches of accomplishing that stuff.
Files uploaded to my server is expected to be huge ~>100 million
Approach 1
These two /pictures/hd/ & /pictures/low/ directories will contain all the files uploaded by the user.
$newfilename = $user_id.time().$filename; //$filename = actual filename of uploaded file
$src = '/pictures/hd/'.$newfilename; //for hd pics
Inserting that into mysql by
insert into pics(`user_id`,`src`)VALUES('$user_id','$newfilename')
Approach 2
These two /pictures/hd/ & /pictures/low/ directories will contain sub-directories of the files uploaded by the user.
This is going to create lots of subdirectories with the name as user_id of the user who is uploading the file into the server.
if (!is_dir('/pictures/hd/'.$user_id.'/')) {
mkdir('/pictures/hd/'.$user_id.'/');
}
$newfilename = $user_id.'/'.$user_id.time().$filename; //$filename = actual filename of uploaded file
$src = '/pictures/hd/'.$newfilename; //for hd pics
Inserting that into mysql by
insert into pics(`user_id`,`src`)VALUES('$user_id','$newfilename')
Retrieval
When retrieving the image i can use the src column of my pics table to get the filename and explore the hd file using the '/pictures/hd/'.$src_of_picstable and lowq files using '/pictures/low/'.$src_of_picstable
The right way to answer the question is to test it.
Which is faster will depend on the number of files and the underlyng filesystem; ext3,4 will quite happily cope with very large numbers of files in a single directory (dentries atr managed in an HTree index). Some filesystems just use simple lists. Others have different ways of optimizing file access.
Your first problem of scaling will be how to manage the file set across multiple disks. Just extending a single filesystem across lots of disks is a bad idea. If you have lots of directories, then you can have lots of mount points. But this doesn't work all that well when you get to terrabytes of data.
However that the content is indexed independently of the file storage means that it doesn't matter what you choose now for your file storage, because you can easily change the mapping of files to location later without having to move your existing dataset around.
I wouldn't suggest single directory approach for two reasons. Firstly, if you're planning to have a lot of images your directory will get really big. And searching for a single image manually will take a lot longer. This will be needed when you debug something ir test new features.
Second reason for multiple directories is that you can smaller backups of part of your gallery. And if you have really big gallery(lets say several terabytes) single hard drive might not be enough to contain them all. With multiple directories you can mount each directory on separate hard drive and this way handle almost infinite size gallery.
My favorite approach is YYYY/MM/type-of-image directory structure. This way you can spot when did you introduce some bug by looking month by month. Also you can make monthly backups without duplicating redundant files. Also making quarterly snapshots of all gallery just in case.
Also about type-of-image there are several types of images that I might need such as original image, small thumbnail, thumbnail, normal image and etc. This way i can just swap type of image and get different image size.
As for you I would suggest YYYY/MM/type-of-image/user_id approach where you could easily find all user uploaded files in one place.

Organizing user images

I am developing a social system where users can upload images.
I am not very sure how I should structure my files. The two ideas I have is the following:
Add a folder with the userID as name and put all the users images in there
Have a single folder for all images, and give each image a unique name
Which one of the two would be better? Is it the proper way to organize images?
It depends.
When you use have a gallery with images stored in a directory per user, you also have to worry about duplicates (say user will upload several files with the same name).
On the other hand in the first case you might run into performance issues related to number of files in a single directory (if you have more than 10k of files it might lag).
Solution I like is is to create a unique name and crop it into several parts to create a directory structure.
For example:
Generate from a file image.jpg (look into the manual) a unique name, say
nfsr53a5gb
Add to it the original extension so it becomes nfsr53a5gb.jpg
Split it with / into nf/sr/53/a5/gb.jpg
Create missing folders as you go (see recursive parameter)
You won't hit the penalty for number of files in a directory soon and you won't get a collision and files have URLs difficult to guess.
A nice touch is to add a controller for getting those files which will change the name to the original name (store it in database and switch headers). Use it only for a dedicated download button as this might be too CPU and I/O intensive for many images embedded on a page.
In order to do this you have to modify headers like this:
Content-Disposition: attachment; filename=YOUR_FILE_NAME.YOUR_EXTENSION
Content-Type: application/octet-stream
Go for #2 option, OR if number of user increases number folders will keep on increse.
For fetching images in your code also this method will be simpler.

What is the most efficient way to store 500.000 images?

I'm coding a basic gallery for a website with around 40.000 online people at any given time. Users will be able to create galleries and upload images.
My question is, should I make a seperate folder for each gallery and put the images in them, or make a single folder and put all images in it, but keep the gallery_id for each image in the database? Or, should I make a directory for every user, then another directory inside them for the gallery names?
How would you do this?
Ps. I need it to be as light as it can.
I would store them by id
and i would split them into folders (dependant of filesystem, some don't perform well with lots of files in 1 folder), plus it is easier to find them if you have to manually look at something
Give each file an id, then using the first 3 digits of the file name, split them into folders. (you could start your auto-increment counter at 100000 or zero pad the id, so there is at least 3 levels
/photos/1/0/3/103456.jpg
/photos/9/4/1/941000.jpg
/photos/0/0/0/000001.jpg
You can store the relationship of photo to user / gallery / etc in the database
Or if you want to see how the big boys do it
Needle in a haystack: efficient storage of billions of photos
Typically web servers don't want you to have more than a few thousand images in a single folder (I recently had to deal with 70,000 images causing super slow reads and sorts so trust me on this) so certainly not a single folder if you think you will have thousands of images. I would suggest the best solution would be to host off of amazon's S3 connected to their CDN CloudFront but if that isn't realistic you can still do several things just on your own server.
Make a separate folder for each gallery like you suggest only if you know some bounds on how large a gallery can get and have an idea of how many galleries will be created. (This is what I would suggest for your specific problem right now)
Put the image name through a hash function then use the first 1-3 characters of the hash to name folders to put the images into. The hash ensures that the images are roughly equally split among the folders and you can decide how many folders you need.
At any rate having the information of what gallery and the image id in the actual path will probably be useful to you moving forward both in code and whenever a human has to hunt bugs on the server. I would probably name the folders based on the gallery id and just make sure that no gallery has more than a few thousand images in it.
I store mine like this:
images/userid/photoid
This way I can quickly isolate user images if I need to inspect anything at a later date. It seems more organized than dropping them all in one central directory.

What is best practice when it comes to storing images for a gallery?

My question is not about storing images on disk or in DB.
Images will be stored on disk
Image path and other image data will be saved in database.
Images will be given a unique filename
Images will be stored in 3 sizes
In time there may be many images used by many users
My questions are:
- Should images be stored in one folder, or many folders?
- Is it ok to use md5 for creating unique id's? E.g. md5(id+filename+random_num)
- Should images be cached on server or on clients browser / computer?
Anything else I should think of?
The solution is using php, apache and mysql. We use Uploadify for uploading images.
Some code I use today
/**
* Calculate dir tree for object
* Folders starts from 00 to FF (HEX) and can have just as
* many subfolders (I think :)
* #param $id - User ID
* #param $type - Image category
* #return string
*/
function calculateDirTree($id, $type)
{
$hashUserID = substr(hash('md5', $id), -4);
$parentFolder = substr($hashUserID,0,2);
$subfolder = substr($hashUserID,2);
$basePath = $type."/".$parentFolder.'/'.$subfolder.'/';
return $basePath;
}
Should images be stored in one folder, or many folders?
You are talking about "100k - 200k images" so many folders is a must have. Try to have max. ~1000 images in on folder.
Is it ok to use md5 for creating unique id's? E.g. md5(id+filename+random_num)
Yes, you can do this. It will avoid problems with long filenames.
Should images be cached on server or on clients browser / computer?
The should be cached on the client side. The problem with so many images is that it creates high traffic. Caching on the client help reducing this.
Depending on the number of images you want to handle, I would strongly suggest using several folders. The easiest way should be to use the first letter of the file name to create a folder structure. I think, the numbers are something like this:
less than 1000 images --> one folder
less than 20000 images --> one level of folders (a, b, c, ...)
more --> several levels (a containing aa, ab, b containing ba, bb, ...)
YMMV
I think using multiple folders or same folder is dependent on your web application. For instance, if there are multiple profiles with each profile having multiple images, you can use multiple folders with using folder names as profile names.
My last advice is if you have tons of images sha256 encryption algorithm is better for preventing collision
Regarding the caching, it would be best to cache it at both ends, that way new images are retrieved quickly, and users visiting existing images have it cached.
I don't know of any filesystem limits regarding storing them in one or multiple folders.
definetely go for the File System: it's more performant and fits better for storing files (that's what is made for). Sql can slow down when saving/retrieving large images.
You can create a folder for each user (using the ID as the folder name) and when an image is saved on the file system you can save the reference on a UserImages table (by saving the filename against the user on sql). You can ensure that each image got a unique filename by renaming it when you saving it, you can use a combination of the original filename with the actual DateTime (no need of using MD5).
Also, images should always be cached in order to save yours and the clients bandwith.

Upload sets of images to one single folder or seperate folders for each set?

I would be having 150-200 products on my website which could grow in the future and I have around 30 - 40 images for each product so I wanted to ask should I have separate folders for storing images of each product or save all the images in one single folder?
Thanks
I'd use a separate folder for each product - it seems much neater that way and won't end up with too many files in the directory.
It would also make it easier to iterate over all a product's images.
If each product has a unique ID number then you should probably use that for the folder name, unless you want to go the full SEO route and have something like /images/my-product-name/...jpg
That depends on the file system - with FAT32 a directory can contain up to 65,536 entries (i.e. files), so you'll probably be fine.
It also depends on whether you only ever want programmatic access, or whether you want some person to have to ever look at a directory with 65K files.

Categories