I store files of users in their own name directory something like
/username/file01.jpg
/username/file02.mp4
/username/file03.mp3
But if more users come and upload more files then this creates problem because this will lead to migration of some or many users to another drive.I choose username directory solution first because i dont want filenames to be mixed. I dont want to change filename too. Also if another user upload same filename then it creates problem ,if the files are stored with original name.
What could be the best way to do this. I have one solution but want to ask community is this the best way .
i will use sequential folders and then hash the file name to some thing very unique and store into the directory.
What i will do is store the original name of file and username into database and hashvalue of filename which is stored in Disk.
When anyone want to access that file,I will read that file through php either replace the name or will do something at that point so that the file is downloaded as original filename.
I have only this proposed solution in mind. Do you guys have any other better than this one.
Edit:
I use folder system too, and possibly for 2nd way i will use virtual folders.
My database is MongoDB
Guys all your answers were awesome and really helpful. i wanted to give bounty to everyone, thats why i left it so that community can provide automatically.
Thanks all for your answers.I really appreciate it.
Could you create relational MySQL tables? e.g.:
A users table and a files table.
Your users table would keep track of everything you are (I assume) already tracking:
id, name, email, etc.
Then the files table would store something like:
id, fileExtension, fileSize, userID <---- userID would be the foreign key pointing to the id field in the files table.
then when you save your file you could save it as it's id.fileExtension and use a query to pull the user associated with that file, or all files associated with a user.
e.g.:
SELECT users.name, files.id, files.extension
FROM `users`
INNER JOIN `files` on users.id = files.userID;
I handle file metadata on the database and retrive the files with a UUID. What i do is:
Content based identification
MD5 from file's content
Namespaced UUID:v5 to generate unique identifier based on user's uuid and file's md5.
Custom function to generate path based on 'realname'.
Save on the database: uuid, originalname (the uploaded name), realname (the generated name), filesize, and mime. (optional dateAdded, and md5)
File retrival.
UUID to retrive metadata.
regenerate filepath based on realname.
Originalname is used to show a familiar name to the user that downloads the file.
I process the file's name assigning it a namespaced UUID as the database primary key, and Generate the path based on User and filename. The precondition is that your user has a uuid assigned to him. The following code will help you avoid id collisions on the database, and help you identify files by its contents (If you ever need to have a way to spot duplicate content and not necesarily filenames).
$fileInfo = pathinfo($_FILE['file']['name']);
$extension = (isset($fileInfo['extension']))?".".$fileInfo['extension']:"";
$md5Name = md5_file($_FILE['file']['tmp_name']); //you could use other hash algorithms if you are so inclined.
$realName = UUID::v5($user->uuid, $md5Name) . $extension; //UUID::v5(namespace, value).
I use a function to generate the filepath based on some custom parameteres, you could use $username and $realname. This is helpful if you implement a distributed folder structure which you might have partitioned on file naming scheme, or any custom scheme.
function generateBasePath($realname, $customArgsArray){
//Process Args as your requirements.
//might as well be "$FirstThreeCharsFromRealname/"
//or a checksum that helps you decide which drive/volume/mountpoint to use.
//like some files on the local disk and some other from an Amazon::S3 mountpoint.
return $mountpoint.'/'.$generatedPath;
}
As an added bonus this also:
helps you maintain a versioned file repository if you add an attribute on the file's record of which file (uuid) it has replaced.
create a application Access Control List if you add an attributes of 'owner' and/or 'group'
also works on a single folder structure.
Note: I used php's $_FILE as an example of the file source based on this question's tags. It can be from any file source or generated content.
Since you already use MongoDB, I would suggest checking out GridFS. It's a specification that allows you to store files(even if they are larger than 16mb) into MongoDB collections.
It is scalable, so you'll have no problems if you add another server, it also stores metadata, it is possible to read files in chunks and it also has built in backup functions.
Another tactic is to create a 2-dimensional structure where the first level of directories are the first 2 characters of the username, then the second level is the remaining characters (similar to how Git stores its SHA-1 object IDs). For example:
/files/jr/andomuser/456.jpg
for user 'jrandomuser'.
Please note that as usernames will likely not be distributed as randomly as SHA-1 values, you may need to add another level later on. Doubt it, though.
I would generate a GUID based on a hash of the filename, Date and Time of the Upload and username for the Filename, save those values, as well as the path to the file in a database for later use. If you generate such a GUID, the filenames can not be guessed.
As example lets take user Daniel Steiner (me) uploads a file called resume.doc on the 23rd of april 2013 at 37 past twelve am to your server. this would give a base value of
Daniel_Steiner+2013/23/04+00:37+resume.doc which then would be as MD5 hash 05c2d2f501e738b930885d991d136f1e. to ensure that the file will be opened in the right programm, we will afterwards add the right file ending and thus will get something like http://link.to/your/site/05c2d2f501e738b930885d991d136f1e.doc If your useraccounts already have a user id, you could add those to the URL, for example, if my User ID would be 123145, the url would be http://link.to/your/site/123145/05c2d2f501e738b930885d991d136f1e.doc
If you save the original filename to the database, you can later also offer a downloadscript that provides the file with its original filename for download, even tough it has another filename on your server.
In case you can use symbolic links, relocating the files on another harddisk shouldn't be a problem either.
If you want to, I could come up with an PHP example as well - shouldn't be too much code.
Since filesystem is a tree, not a graph (faceted classification), its hard to come up with some way for it to easily represent multiple entities, like users, media types, dates, events, image crop types etc. Thats why using relational database is easier - it is convertible to graph.
But since its another level of abstraction, you need to write functions that do low-level synchronization yourself, including avoiding name collisions, long path names, large file count per folder, ease of transfer per-entity, horizontal scaling etc. So it depends how complex your application needs to be
I suggest to use following database structure:
Where File table has at least:
IDFile is an auto_increment column / primary key.
UserID is nullable foreign key.
For FK_File_User I suggest:
ON UPDATE NO ACTION -- IDUser is auto_increment too. No changes need to be tracked.
ON DELETE SET NULL -- If user deleted, then File is not owned. Might be deleted
-- with CRON job or something else.
Still, another columns might be added to the File table:
Actual upload date and time
Actual mime-type
Actual storage place (for distributed storage systems)
Download count (another table might be a better solution)
etc...
Some benefits:
You don't need to calculate file size, hash, extension or any file meta, because you might obtain it with one database operation.
You can obtain statistics for each user of a file count / space used / whatever you wrote to File table by single SELECT ... GROUP BY ... WITH ROLLUP statement, and it would be faster, than analysis of actual files, which may be spread across multiple storage devices.
You may apply file access permissions for different users. It will cost not significant change of table structures database.
I don't consider as an option, that original filenames needed at storage, because of two reasons:
File may have name, which not correctly supported by Server OS filesystem, like Cyrillic ones.
Two different files may have completely identical names, so one of them might be overwritten by another.
So, there is a solution:
1) Rename files when they are uploaded to IDFile from INSERT into File table. It's safe and there are no dublicates.
2) Restore name of the file, when it's needed / downloaded, like:
// peform query to "File" table by given ID
list($name, $ext, $size, $md5) = $result->fetch_row();
$result->free();
header('Content-Length: ' . $size);
header('Content-MD5: ' . $md5);
header('Accept-Ranges: bytes');
header('Connection: close');
header('Content-Type: application/force-download');
header('Content-Disposition: attachment; filename="' . $name . '.' . $ext . '"');
// flush file content
3) Actual files may be stored within single directory (because IDFile is safe) and IDUser-named subdirectory - depends on a situation.
4) As IDFile is a direct sequence, if some of files are gone missing, you may obtain their database meta by evaluating missing segments of actual filenames sequence. Then, you may "inform owners", "delete file meta" or both of this actions.
I'm against the idea of storing large actual files in DBMS itself as a binary content.
DBMS is about data and analysis, it's not a FileSystem, and should never be used in that way, if my humble opinion matters.
You can install a LDAP server. LDAP lookup is very fast since it is highly optimized for heavy read operations. You can even query for data
LDAP organizes the data in a tree like fashion.
You can organize data as following example "user->IP address->folder->file name". This way file could be physically/geographically spread out and you can fetch the location very quickly.
You can query too using standard LDAP query for e.g. get all the list of file for a particular user or get the list of files in the folder etc.
Mongodb to store the actual filename (eg: myImage.jpg) and other attributes (eg: MIME types), plus $random-text.jpg from 2. & 3. below
Generate some $random-text, eg: base_convert(mt_rand(), 10, 36) or uniqid($username, true);
Physically store the file as $random-text.jpg - always good to maintain same extension
NOTE: Use filter_var() to ensure the input filename doesn't pose security risk to Mongodb.
Amazon S3 is reliable and cheap, be aware of "Eventual Concurrency" with S3.
Assuming users have a unique ID (Primary Key) in the database, if a user with ID 73 uploads a file, save it like this:
"uploads/$userid_$filename.$ext"
For example, 73_resume.doc, 73_myphoto.jpg
Now, when fetching files, use this code:
foreach (glob("uploads/$userid_*.*") as $filename) {
echo $filename;
}
This can be combined with hashing solutions (stored in the DB), so that a user who gets a download path as 73_photo.jpg does not randomly try 74_photo.jpg in the browser address bar.
I am developing a social system where users can upload images.
I am not very sure how I should structure my files. The two ideas I have is the following:
Add a folder with the userID as name and put all the users images in there
Have a single folder for all images, and give each image a unique name
Which one of the two would be better? Is it the proper way to organize images?
It depends.
When you use have a gallery with images stored in a directory per user, you also have to worry about duplicates (say user will upload several files with the same name).
On the other hand in the first case you might run into performance issues related to number of files in a single directory (if you have more than 10k of files it might lag).
Solution I like is is to create a unique name and crop it into several parts to create a directory structure.
For example:
Generate from a file image.jpg (look into the manual) a unique name, say
nfsr53a5gb
Add to it the original extension so it becomes nfsr53a5gb.jpg
Split it with / into nf/sr/53/a5/gb.jpg
Create missing folders as you go (see recursive parameter)
You won't hit the penalty for number of files in a directory soon and you won't get a collision and files have URLs difficult to guess.
A nice touch is to add a controller for getting those files which will change the name to the original name (store it in database and switch headers). Use it only for a dedicated download button as this might be too CPU and I/O intensive for many images embedded on a page.
In order to do this you have to modify headers like this:
Content-Disposition: attachment; filename=YOUR_FILE_NAME.YOUR_EXTENSION
Content-Type: application/octet-stream
Go for #2 option, OR if number of user increases number folders will keep on increse.
For fetching images in your code also this method will be simpler.
I am building a site that is looking at Millions of photos being uploaded easily (with 3 thumbnails each for each image uploaded) and I need to find the best method for storing all these images.
I've searched and found examples of images stored as hashes.... for example...
If I upload, coolparty.jpg, my script would convert it to an Md5 hash resulting in..
dcehwd8y4fcf42wduasdha.jpg
and that's stored in /dc/eh/wd/dcehwd8y4fcf42wduasdha.jpg
but for the 3 thumbnails I don't know how to store them
QUESTIONS..
Is this the correct way to store these images?
How would I store thumbnails?
In PHP what is example code for storing these images using the method above?
How am I using the folder structure:
I'm uploading the photo, and move it like you said:
$image = md5_file($_FILES['image']['tmp_name']);
// you can add a random number to the file name just to make sure your images will be "unique"
$image = md5(mt_rand().$image);
$folder = $image[0]."/".$image[1]."/".$image[2]."/";
// IMAGES_PATH is a constant stored in my global config
define('IMAGES_PATH', '/path/to/my/images/');
// coolparty = f3d40fc20a86e4bf8ab717a6166a02d4
$folder = IMAGES_PATH.$folder.'f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// thumbnail, I just append the t_ before image name
$folder = IMAGES_PATH.$folder.'t_f3d40fc20a86e4bf8ab717a6166a02d4.jpg';
// move_uploaded_file(), with thumbnail after process
// also make sure you create the folders in mkdir() before you move them
I do believe is the base way, of course you can change the folder structure to a more deep one, like you said, with 2 characters if you will have millions of images.
The reason you would use a method like that is simply to reduce the total number of files per directory (inodes).
Using the method you have described (3 levels deeps) you are very unlikely to reach even hundreds of images per directory since you will have a max number of directories of almost 17MM. 16**6.
As far as your questions.
Yeah, that is a fine way to store them.
The way I would do it would be
/aa/bb/cc/aabbccdddddddddddddd_thumb.jpg
/aa/bb/cc/aabbccdddddddddddddd_large.jpg
/aa/bb/cc/aabbccdddddddddddddd_full.jpg
or similar
There are plenty of examples on the net as far as how to actually store images. Do you have a more specific question?
If you're talking millions of photos, I would suggest you farm these off to a third party such as Amazon Web Services, more specifically for this Amazon S3. There is no limit for the number of files and, assuming you don't need to actually list the files, there is no need to separate them into directories at all (and if you do need to list, you can use different delimeters and prefixes - http://docs.amazonwebservices.com/AmazonS3/latest/dev/ListingKeysHierarchy.html). And your hosting/rereival costs will probably be lower than doing yourself - and they get backed up.
To answer more specifically, yes, split by sub directories; using your structure, you can drop the first 5 characters of the filename as you alsready have it in the directory name.
And thumbs, as suggested by aquinas, just appent _thumb1 etc to the filename. Or store in separate folders themsevles.
1) That's something only you can answer. Generally, I prefer to store the images in the database so you can have ONE consistent backup, but YMMV.
2) How? How about /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb1.jpg, /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb2.jpg and /dc/eh/wd/dcehwd8y4fcf42wduasdha_thumb3.jpg
3) ??? Are you asking how to write a file to the file system or...?
Improve Answer.
For millions of Images, as yes, it is correct that using database will slow down the process
The best option will be either use "Server File System" to store images and use .htaccess to add security.
or you can use web-services. many servers like provide Images Api for uploading, displaying.
You can go on that option also. For example Amazon
My question is not about storing images on disk or in DB.
Images will be stored on disk
Image path and other image data will be saved in database.
Images will be given a unique filename
Images will be stored in 3 sizes
In time there may be many images used by many users
My questions are:
- Should images be stored in one folder, or many folders?
- Is it ok to use md5 for creating unique id's? E.g. md5(id+filename+random_num)
- Should images be cached on server or on clients browser / computer?
Anything else I should think of?
The solution is using php, apache and mysql. We use Uploadify for uploading images.
Some code I use today
/**
* Calculate dir tree for object
* Folders starts from 00 to FF (HEX) and can have just as
* many subfolders (I think :)
* #param $id - User ID
* #param $type - Image category
* #return string
*/
function calculateDirTree($id, $type)
{
$hashUserID = substr(hash('md5', $id), -4);
$parentFolder = substr($hashUserID,0,2);
$subfolder = substr($hashUserID,2);
$basePath = $type."/".$parentFolder.'/'.$subfolder.'/';
return $basePath;
}
Should images be stored in one folder, or many folders?
You are talking about "100k - 200k images" so many folders is a must have. Try to have max. ~1000 images in on folder.
Is it ok to use md5 for creating unique id's? E.g. md5(id+filename+random_num)
Yes, you can do this. It will avoid problems with long filenames.
Should images be cached on server or on clients browser / computer?
The should be cached on the client side. The problem with so many images is that it creates high traffic. Caching on the client help reducing this.
Depending on the number of images you want to handle, I would strongly suggest using several folders. The easiest way should be to use the first letter of the file name to create a folder structure. I think, the numbers are something like this:
less than 1000 images --> one folder
less than 20000 images --> one level of folders (a, b, c, ...)
more --> several levels (a containing aa, ab, b containing ba, bb, ...)
YMMV
I think using multiple folders or same folder is dependent on your web application. For instance, if there are multiple profiles with each profile having multiple images, you can use multiple folders with using folder names as profile names.
My last advice is if you have tons of images sha256 encryption algorithm is better for preventing collision
Regarding the caching, it would be best to cache it at both ends, that way new images are retrieved quickly, and users visiting existing images have it cached.
I don't know of any filesystem limits regarding storing them in one or multiple folders.
definetely go for the File System: it's more performant and fits better for storing files (that's what is made for). Sql can slow down when saving/retrieving large images.
You can create a folder for each user (using the ID as the folder name) and when an image is saved on the file system you can save the reference on a UserImages table (by saving the filename against the user on sql). You can ensure that each image got a unique filename by renaming it when you saving it, you can use a combination of the original filename with the actual DateTime (no need of using MD5).
Also, images should always be cached in order to save yours and the clients bandwith.
What are some ideas out there for storing images on web servers. Im Interacting with PHP and MySQL for the application.
Question 1
Do we change the name of the physical file to a000000001.jpg and store it in a base directory or keep the user's unmanaged file name, i.e 'Justin Beiber Found dead.jpg'? For example
wwroot/imgdir/a0000001.jpg
and all meta data in a database, such as FileName and ReadableName, Size, Location, etc.
I need to make a custom Filemanager and just weighing out some pros and cons of the underlying stucture of how to store the images.
Question 2
How would I secure an Image from being downloaded if my app/database has not set it to be published/public?
In my app I can publish images, or secure them from download, if I stored the image in a db table I could store it as a BLOB and using php prevent the user from downloading it. I want to be able to do the same with the image if it was store in the FileSystem, but im not sure if this is possible with PHP and Files in the system.
Keeping relevant file names can be good for SEO, but you must also make sure you don't duplicate.
In all cases I would rename files to lowercase and replace spaces by underscores (or hyphens)
Justin Beiber Found dead.jpg => justin_beiber_finally_dead.jpg
If the photo's belongs to an article or something specific you can perhaps add the article ID to the image, i.e. 123_justin_beiber_found_dead.jpg. Alternatively you can store the images in an article specific folder, i.e. /images/123/justin_beiber_found_dead.jpg.
Naming the files like a0000001 removes all relevance to the files and adds no value whatsoever.
Store (full) filepaths only in the database.
For part 2;
I'm not sure what the best solution here is, but using the filesystem, I think you will have to configure apache to serve all files in a particular directory by PHP. In PHP you can then check if the file can be published and then spit it out. If not, you can serve a dummy image. This however is not very efficient and will be much heavier on apache.