Best way to store large amount of data of users

Best way to store large amount of data of users - php

I store files of users in their own name directory something like
/username/file01.jpg
/username/file02.mp4
/username/file03.mp3
But if more users come and upload more files then this creates problem because this will lead to migration of some or many users to another drive.I choose username directory solution first because i dont want filenames to be mixed. I dont want to change filename too. Also if another user upload same filename then it creates problem ,if the files are stored with original name.
What could be the best way to do this. I have one solution but want to ask community is this the best way .
i will use sequential folders and then hash the file name to some thing very unique and store into the directory.
What i will do is store the original name of file and username into database and hashvalue of filename which is stored in Disk.
When anyone want to access that file,I will read that file through php either replace the name or will do something at that point so that the file is downloaded as original filename.
I have only this proposed solution in mind. Do you guys have any other better than this one.
Edit:
I use folder system too, and possibly for 2nd way i will use virtual folders.
My database is MongoDB
Guys all your answers were awesome and really helpful. i wanted to give bounty to everyone, thats why i left it so that community can provide automatically.
Thanks all for your answers.I really appreciate it.

Could you create relational MySQL tables? e.g.:
A users table and a files table.
Your users table would keep track of everything you are (I assume) already tracking:
id, name, email, etc.
Then the files table would store something like:
id, fileExtension, fileSize, userID <---- userID would be the foreign key pointing to the id field in the files table.
then when you save your file you could save it as it's id.fileExtension and use a query to pull the user associated with that file, or all files associated with a user.
e.g.:
SELECT users.name, files.id, files.extension
FROM `users`
INNER JOIN `files` on users.id = files.userID;

I handle file metadata on the database and retrive the files with a UUID. What i do is:
Content based identification
MD5 from file's content
Namespaced UUID:v5 to generate unique identifier based on user's uuid and file's md5.
Custom function to generate path based on 'realname'.
Save on the database: uuid, originalname (the uploaded name), realname (the generated name), filesize, and mime. (optional dateAdded, and md5)
File retrival.
UUID to retrive metadata.
regenerate filepath based on realname.
Originalname is used to show a familiar name to the user that downloads the file.
I process the file's name assigning it a namespaced UUID as the database primary key, and Generate the path based on User and filename. The precondition is that your user has a uuid assigned to him. The following code will help you avoid id collisions on the database, and help you identify files by its contents (If you ever need to have a way to spot duplicate content and not necesarily filenames).
$fileInfo = pathinfo($_FILE['file']['name']);
$extension = (isset($fileInfo['extension']))?".".$fileInfo['extension']:"";
$md5Name = md5_file($_FILE['file']['tmp_name']); //you could use other hash algorithms if you are so inclined.
$realName = UUID::v5($user->uuid, $md5Name) . $extension; //UUID::v5(namespace, value).
I use a function to generate the filepath based on some custom parameteres, you could use $username and $realname. This is helpful if you implement a distributed folder structure which you might have partitioned on file naming scheme, or any custom scheme.
function generateBasePath($realname, $customArgsArray){
//Process Args as your requirements.
//might as well be "$FirstThreeCharsFromRealname/"
//or a checksum that helps you decide which drive/volume/mountpoint to use.
//like some files on the local disk and some other from an Amazon::S3 mountpoint.
return $mountpoint.'/'.$generatedPath;
}
As an added bonus this also:
helps you maintain a versioned file repository if you add an attribute on the file's record of which file (uuid) it has replaced.
create a application Access Control List if you add an attributes of 'owner' and/or 'group'
also works on a single folder structure.
Note: I used php's $_FILE as an example of the file source based on this question's tags. It can be from any file source or generated content.

Since you already use MongoDB, I would suggest checking out GridFS. It's a specification that allows you to store files(even if they are larger than 16mb) into MongoDB collections.
It is scalable, so you'll have no problems if you add another server, it also stores metadata, it is possible to read files in chunks and it also has built in backup functions.

Another tactic is to create a 2-dimensional structure where the first level of directories are the first 2 characters of the username, then the second level is the remaining characters (similar to how Git stores its SHA-1 object IDs). For example:
/files/jr/andomuser/456.jpg
for user 'jrandomuser'.
Please note that as usernames will likely not be distributed as randomly as SHA-1 values, you may need to add another level later on. Doubt it, though.

I would generate a GUID based on a hash of the filename, Date and Time of the Upload and username for the Filename, save those values, as well as the path to the file in a database for later use. If you generate such a GUID, the filenames can not be guessed.
As example lets take user Daniel Steiner (me) uploads a file called resume.doc on the 23rd of april 2013 at 37 past twelve am to your server. this would give a base value of
Daniel_Steiner+2013/23/04+00:37+resume.doc which then would be as MD5 hash 05c2d2f501e738b930885d991d136f1e. to ensure that the file will be opened in the right programm, we will afterwards add the right file ending and thus will get something like http://link.to/your/site/05c2d2f501e738b930885d991d136f1e.doc If your useraccounts already have a user id, you could add those to the URL, for example, if my User ID would be 123145, the url would be http://link.to/your/site/123145/05c2d2f501e738b930885d991d136f1e.doc
If you save the original filename to the database, you can later also offer a downloadscript that provides the file with its original filename for download, even tough it has another filename on your server.
In case you can use symbolic links, relocating the files on another harddisk shouldn't be a problem either.
If you want to, I could come up with an PHP example as well - shouldn't be too much code.

Since filesystem is a tree, not a graph (faceted classification), its hard to come up with some way for it to easily represent multiple entities, like users, media types, dates, events, image crop types etc. Thats why using relational database is easier - it is convertible to graph.
But since its another level of abstraction, you need to write functions that do low-level synchronization yourself, including avoiding name collisions, long path names, large file count per folder, ease of transfer per-entity, horizontal scaling etc. So it depends how complex your application needs to be

I suggest to use following database structure:
Where File table has at least:
IDFile is an auto_increment column / primary key.
UserID is nullable foreign key.
For FK_File_User I suggest:
ON UPDATE NO ACTION -- IDUser is auto_increment too. No changes need to be tracked.
ON DELETE SET NULL -- If user deleted, then File is not owned. Might be deleted
-- with CRON job or something else.
Still, another columns might be added to the File table:
Actual upload date and time
Actual mime-type
Actual storage place (for distributed storage systems)
Download count (another table might be a better solution)
etc...
Some benefits:
You don't need to calculate file size, hash, extension or any file meta, because you might obtain it with one database operation.
You can obtain statistics for each user of a file count / space used / whatever you wrote to File table by single SELECT ... GROUP BY ... WITH ROLLUP statement, and it would be faster, than analysis of actual files, which may be spread across multiple storage devices.
You may apply file access permissions for different users. It will cost not significant change of table structures database.
I don't consider as an option, that original filenames needed at storage, because of two reasons:
File may have name, which not correctly supported by Server OS filesystem, like Cyrillic ones.
Two different files may have completely identical names, so one of them might be overwritten by another.
So, there is a solution:
1) Rename files when they are uploaded to IDFile from INSERT into File table. It's safe and there are no dublicates.
2) Restore name of the file, when it's needed / downloaded, like:
// peform query to "File" table by given ID
list($name, $ext, $size, $md5) = $result->fetch_row();
$result->free();
header('Content-Length: ' . $size);
header('Content-MD5: ' . $md5);
header('Accept-Ranges: bytes');
header('Connection: close');
header('Content-Type: application/force-download');
header('Content-Disposition: attachment; filename="' . $name . '.' . $ext . '"');
// flush file content
3) Actual files may be stored within single directory (because IDFile is safe) and IDUser-named subdirectory - depends on a situation.
4) As IDFile is a direct sequence, if some of files are gone missing, you may obtain their database meta by evaluating missing segments of actual filenames sequence. Then, you may "inform owners", "delete file meta" or both of this actions.
I'm against the idea of storing large actual files in DBMS itself as a binary content.
DBMS is about data and analysis, it's not a FileSystem, and should never be used in that way, if my humble opinion matters.

You can install a LDAP server. LDAP lookup is very fast since it is highly optimized for heavy read operations. You can even query for data
LDAP organizes the data in a tree like fashion.
You can organize data as following example "user->IP address->folder->file name". This way file could be physically/geographically spread out and you can fetch the location very quickly.
You can query too using standard LDAP query for e.g. get all the list of file for a particular user or get the list of files in the folder etc.

Mongodb to store the actual filename (eg: myImage.jpg) and other attributes (eg: MIME types), plus $random-text.jpg from 2. & 3. below
Generate some $random-text, eg: base_convert(mt_rand(), 10, 36) or uniqid($username, true);
Physically store the file as $random-text.jpg - always good to maintain same extension
NOTE: Use filter_var() to ensure the input filename doesn't pose security risk to Mongodb.
Amazon S3 is reliable and cheap, be aware of "Eventual Concurrency" with S3.

Assuming users have a unique ID (Primary Key) in the database, if a user with ID 73 uploads a file, save it like this:
"uploads/$userid_$filename.$ext"
For example, 73_resume.doc, 73_myphoto.jpg
Now, when fetching files, use this code:
foreach (glob("uploads/$userid_*.*") as $filename) {
echo $filename;
}
This can be combined with hashing solutions (stored in the DB), so that a user who gets a download path as 73_photo.jpg does not randomly try 74_photo.jpg in the browser address bar.

Related

Storing image/data in MySQL and naming conventions

What are some ideas out there for storing images on web servers. Im Interacting with PHP and MySQL for the application.
Question 1
Do we change the name of the physical file to a000000001.jpg and store it in a base directory or keep the user's unmanaged file name, i.e 'Justin Beiber Found dead.jpg'? For example
wwroot/imgdir/a0000001.jpg
and all meta data in a database, such as FileName and ReadableName, Size, Location, etc.
I need to make a custom Filemanager and just weighing out some pros and cons of the underlying stucture of how to store the images.
Question 2
How would I secure an Image from being downloaded if my app/database has not set it to be published/public?
In my app I can publish images, or secure them from download, if I stored the image in a db table I could store it as a BLOB and using php prevent the user from downloading it. I want to be able to do the same with the image if it was store in the FileSystem, but im not sure if this is possible with PHP and Files in the system.

Keeping relevant file names can be good for SEO, but you must also make sure you don't duplicate.
In all cases I would rename files to lowercase and replace spaces by underscores (or hyphens)
Justin Beiber Found dead.jpg => justin_beiber_finally_dead.jpg
If the photo's belongs to an article or something specific you can perhaps add the article ID to the image, i.e. 123_justin_beiber_found_dead.jpg. Alternatively you can store the images in an article specific folder, i.e. /images/123/justin_beiber_found_dead.jpg.
Naming the files like a0000001 removes all relevance to the files and adds no value whatsoever.
Store (full) filepaths only in the database.
For part 2;
I'm not sure what the best solution here is, but using the filesystem, I think you will have to configure apache to serve all files in a particular directory by PHP. In PHP you can then check if the file can be published and then spit it out. If not, you can serve a dummy image. This however is not very efficient and will be much heavier on apache.

Standard for uploads into a server directory?

I have a topic/question concerning your upload filename standards, if any, that you are using. Imagine you have an application that allows many types of documents to be uploaded to your server and placed into a directory. Perhaps the same document could even be uploaded twice. Usually, you have to make some kind of unique filename adjustment when saving the document. Assume it is saved in a directory, not saved directly into a database. Of course, the Meta Data would probably need to be saved into the database. Perhaps the typical PHP upload methods could be the application used; simple enough to do.
Possible Filenaming Standard:
1.) Append the document filename with a unique id: image.png changed to image_20110924_ahd74vdjd3.png
2.) Perhaps use a UUID/GUID and store the actual file type (meta) in a database: 2dea72e0-a341-11e0-bdc3-721d3cd780fb
3.) Perhaps a combination: image_2dea72e0-a341-11e0-bdc3-721d3cd780fb.png
Can you recommend a good standard approach?
Thanks, Jeff

I always just hash the file using md5() or sha1() and use that as a filename.
E.g.
3059e384f1edbacc3a66e35d8a4b88e5.ext
And I would save the original filename in the database may I ever need it.
This will make the filename unique AND it makes sure you don't have the same file multiple times on your server (since they would have the same hash).
EDIT
As you can see I had some discussion with zerkms about my solution and he raised some valid points.
I would always serve the file through PHP instead of letting user download them directly.
This has some advantages:
I would add records into the database if users upload a file. This would contain the user who uploaded the file, the original filename and tha hash of the file.
If a user wants to delete a file you just delete the record of the user with that file.
If no more users has the file after delete you can delete the file itself (or keep it anyway).
You should not keep the files somewhere in the document root, but rather somewhere else where it isn't accessible by the public and serve the file using PHP to the user.
A disadvantage as zerkms has pointed out is that serving files through PHP is more resource consuming, although I find the advantages to be worth the extra resources.
Another thing zerkms has pointed out is that the extension isn't really needed when saving the file as hash (since it already is in the database), but I always like to know what kind of files are in the directory by simply doing a ls -la for example. However again it isn't really necessarily.

PHP file rename

I have a form where an admin will upload three pictures with different dimensions to three different designated directories. now to make sure that i don't get into the problem of duplicate file names i implemented something like the php will compare the uploaded file name and it will check if that file name exist in the designated directory if yes then it will echo an error and stop the script execution.
Now one of my friend suggested me that it is very bad asking the admin to manually rename the picture file and asking them to take care of the file duplication problem. the solution he suggested was to rename the file automatically and then store it in the database and then direct it to directory.
I am confused about what combination should i give to the renamed file and also make sure it will remain unique file name to be more precise i would like you to understand my directory structure
as i said there will be three pictures files the admin will be uploading namely
a) Title Picture b) Brief Picture c)
Detail Picture
and all the three picture files will be moved to the different respective directory, like title picture goes to title directory and so on.
i am using to script below currently just to move and store the file name with path using varchar in the database.
$ns_pic_title_loc= $_FILES["ns_pic_title"]["tmp_name"];
$ns_pic_title_name = $_FILES["ns_pic_title"]["name"];
move_uploaded_file($ns_pic_title_loc, $ns_title_target.$ns_pic_title_name) or die(mysql_error());
that is just the sample code i havent included the validation function which i am using. i was thinking like i want to rename all the files like
a) In title directory the file should be stored as.
title_1.jpg
title_2.jpg
title_3.jpg
title_4.jpg
and so on
and the same way to rest of the pictures. how do i do that? what function do i use to achieve my target. and if this is not the good way to rename the file i would appreciate any suggestion followed to rename the file.
thanks in advance

Well, here's a possible solution:
Get uploaded filename from $_FILES["ns_pic_title"]["name"] and separate extension OR if we are only talking about image files get the image type with getimagesize($_FILES["ns_pic_title"]["tmp_name"]);
Check your database for the maximum id of the image records and make the the $file_name variable 'title_'.($max_id + 1)
At this point you should have $file_name and $file_extension so do move_uploaded_file($_FILES["ns_pic_title"]["tmp_name"], $ns_title_target.$file_name.'.'.$file_extension)
Hopefully this makes sense and helps.

There are a couple of good options with various pros and cons.
Use php's tempnam when moving the file, and store the path in your mysql database. tempnam generates a unique filename.
Use mysql to store the image content in a blob. This way you will access the image content via an id instead of a pathname.

Instead of having logic to figure out what the latest picture name is and calculate the next number increment, why not just use PHP's tempnam() function? It generates an unique name with a prefix of your choice (i.e., "title", "brief", "detail"). You could also simply prepend a timestamp to the file name -- if you don't have a whole lot of admins uploading pictures at the same time, that should handle most name conflicts.
Since your pictures are going to be sorted into title, brief and detail directories already, it's not really necessary to name each picture title_*, brief_*, and detail_*, right? If it's in the title directory, then it's obviously a title picture.
Also, you're going to be putting the file names in the database. Then elsewhere in the app, when you want to display a picture, I assume you are getting the correct file name from the database. So it isn't really important what the actual file name is as long as the application knows where to find it. If that's correct, it's not necessary to have a very friendly name, thus a tempnam() file name or a timestamp plus the original file name would be acceptable.

Because you are storing references into the DB, I would prefer to just md5 the datetime and use that for the filename and store the disk filename to the DB also. It doesn't matter what name it is written to disk with as long as you can point to it with the unique name into the DB.
I use this methodology, and in none of my testing does the disk name (md5 from the datetime) ever require multiple tries.

how to organize files created dynamically using php?

I have an PHP website which creates and stores HTML template files on server based on user input.one user can create many templates.So to store the template files and associate them with the DB record ,what I do is-
"templates" is the table which hold other information about the template such as who created it etc. with unique auto-increment id as template_id
for example -
if template id is 1001
I convert it to hex which is 03e9
Now I split the hex number into 03 & e9 (after two numbers) becomes folder and e9 becomes
file with some extension as "e9.tpl"
This is how I can find out template from the file system if I know the template ID.I dont need to separately store the path to the file.
is it a good approach ? any shortfalls of this approach ? is there any other approach better than this ?
What are the advantages / disadvantages of storing the path to file in the database itself ? for example to enable using different discs serving templates etc.?

If the ID in the DB table is already UNIQUE, why transform the id for the filesystem at all?
Just add a file 1001.tpl and you are all set. If you want to have template files sorted into folders, use the User ID (which I assume to be UNIQUE too), so you get folder 124/1001.tpl.
Depending on your deployment process, you will want to keep the created files outside the application folder, so not accidently delete them when updating the application.

Are you doing this because you are worried that you might run out of file entries/inodes in the directory? In ext3 the practical limit is somewhere around 100.000 files (and 32.000 dirs).
Creating a directory structure on the fly is better done using modulu as in $dir = $id % 1000 and then put the new template in that dir ($dir/$id.tpl). That strategy will create max 1000 dirs and you have thus made it possible to handle around 100.000.000 files.
I don't see any reason for messing with hexadecimal values or substrings.

If you have to hit the database to get the id, you may be just as well off storing the template in it as well. But there's nothing categorically wrong with storing them on the file system. I generally would.
When you hit 65,536, you'll get 0x10000. Make sure your code can handle that. I'd be more apt to store 0x1234 like: 1/1234.tpl, just for the sake of clarity. Note that by virtue of sequential IDs, your folders will fill up sequentially.
I'd probably not even convert them to hex. You could use a modulus operator to determine which folder to put them in. Figure out how many files you are likely to have and use that to determine how many folders you want.
For example:
$path = ($id % NUMBER_OF_FOLDERS) . "/$id.tpl"
where $id is the template id in decimal.

I don't understand the point in separating the hex into two parts to create different folders... That could create hundreds and hundreds of different folders which would become a complete mess on your server. Why not just store on the templates in one single folder with the hex value as the file name, such as 03e9.tpl?

PHP: Storing file locations...what if overwritten?

I am currently using the Zend Framework and have an upload file form. An authenticated user has the ability to upload a file, which will be stored in a directory in the application, and the location stored in the database. That way it can be displayed as a file that can be downloaded.
Download
But something I am noticing is that a file with the same name will overwrite a file in the uploads directory. There is no error message, nor does the filename increment. So I think the file must be overwritten (or never uploaded).
What are some best practices I should be aware of when uploading, moving, or storing these files? Should I always be renaming the files so that the filename is always unique?

Generally, we don't store files with the name given by the user, but using a name that we (i.e. our application) chosse.
For instance, if a user uploads my_file.pdf, we would :
store a line in the DB, containing :
id ; an autoincrement, the primary key -- "123", for instance
the name given by the user ; so we can send the right name when someone tries to download the file
the content-type of the file ; application/pdf or something like that, for instance.
"our" name : file-123 for instance
when there is a request to the file with id=123, we know which physical file should be fetched ('file-' . $id) and sent.
and we can set some header to send to correct "logical" name to the browser, using the name we stored in the DB, for the "save as" dialog box
same for the content-type, btw
This way, we make sure :
that no file has any "wrong" name, as we are the ones choosing it, and not the client
that there is no overwritting : as our filenames include the primary key of our table, those file names are unique

Continuing on Pascal MARTIN's answer:
If using an id as name you can also come up with a directory naming strategy. I takes no longer to get /somedir/part1ofID/part2OfID from the filesystem than /somedir/theWholeID but it will let you choose how many files are stored in the same directory from how you split the ID to form the path and file name.
The next good thing is that the script that you use to actually output the file to the user can choose if the user is authorized to see the file or not. This of course requires the files to be stored somewhere not readable by everyone by default.
You may also want to look at this other question. Not totally related, but good to be aware of.

Yes you need to come up with a way to name them uniquely. Ive seen all kinds of different strategies for this ranging from a hash base on the orignal filename, pk of the db record and upload timestamp, to some type of slugging, again based on varous fields in the db record its attached to or related records.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.