I'm trying to allow users to upload files through a PHP website. Since all the files are saved in a single folder on the server, it's conceivable (though admittedly with low probability) that two distinct users could upload two files that, while different, are named exactly the same. Or perhaps they're exactly the same file.
In the both cases, I'd like to use exec("openssl md5 " . $file['upload']['tmp_name']) to determine the MD5 hash of the file immediately after it is uploaded. Then I'll check the database for any identical MD5 hash and, if found, I simply won't complete the upload.
However, in the move_uploaded_file documentation, I found this comment:
Warning: If you save a md5_file hash in a database to keep record of uploaded files, which is usefull to prevent users from uploading the same file twice, be aware that after using move_uploaded_file the md5_file hash changes! And you are unable to find the corresponding hash and delete it in the database, when a file is deleted.
Is this really the case? Does the MD5 hash of a file in the tmp directory change after moving it to a permanent location? I don't understand why it would. And regardless, is there another, better way of ensuring the same file is not uploaded to the filesystem multiple times?
If you're convinced by all the reasons given here in the answers and decide not to use md5 at all (I'm still not sure whether you WANT to or MUST use hash), you can just append something unique for each user and the time of uploading to each file name. That way you'll end up with more readable file names. Something like: $filename = "$filename-$user_ip_string-$microtime";. Of course, you must have all three variables ready and formatted before that, it goes without saying.
No chance of the same file name, same IP address and same microtime occuring at the same time, right? You could easily get away with microtime only, but IP will make it even more certain.
Of course, like I said, all this goes if you decide not to use hashing and go for a simpler solution.
Shouldn't you use exec("openssl md5 " . $file['upload']['name']) name instead? I'm thinking that the temporary name differs from upload to upload.
It would seem that it indeed is the case. I have shortly been looking through the docs aswell. But why dont you share the md5 checksum before using move_uploaded_file and store that value in your database linking it directly with the new file? That was you can always check the uploaded file and whether that file already exists in your filesystem.
This does require a database, but most have access to one.
Try renaming the uploaded file to a unique id.
Use this:
$dest_filename = $filename;
if (RENAME_FILE) {
$dest_filename = md5(uniqid(rand(), true)) . '.' . $file_ext;
}
Let me know if it helps :)
No, in general the hash doesn't change by move_uploaded_file somehow magically.
But, if you compute the md5() including the file's path, the hash will certainly change if the file is move to a new path/folder.
In case you md5() the filename, nothing will change.
It's a good idea to rename uploaded files with a unique name.
But don't forget to locate the file to finally store the file, is outside of your document root folder of your vHost. Located there, it can't be downloaded without using a PHP-script.
Final remark: While it's very very unlikely, md5 hashed of two different files may be identical.
Related
I'm writing a file upload site, and am interested in saving space. If a user uploads a file, I want to ensure this file has not already been uploaded before (if it has been, I will just point to the existing file in the database).
I was considering using sha1_file() on the file, checking the database to see if the digest exists in a database of digests. Then I remembered the pigeonhole principle, and decided to check the undigested files against each other if there is a sha1 digest match.
This seems inefficient to me. I figure I could just check the first kilobyte of each file against each other in the event of a check sum match.
I haven't thought too much about the value of RAM versus ROM, and it might be possible that the processing power required to check the files costs more than the storage space I would save.
Are there any shortcomings to this method? Am I wasting my time in even bothering with this?
you could use md5( file_data ) to generate the names of the files and it will never be possible to upload the same file with a different name. only problem with this is that it could be technically possible that two different files generate the same md5, but its unlikely, especially if the two files have the same extension, so you could consider this a non problem. under this schematic, there is no reason to even check. if two hashes are the same, it simply overwrites the stored file. this is how most file storage engines work internally, such as zimg. if you are paranoid about collisions, you could see first if the file exists with the computed hash and extension, and if it does you could compare the data of that stored file vs the data of the file that you are attempting to store. if the data is inequal, you could have it email you an alert.
$data = file_get_contents('flowers.jpg');
$name = md5($data).'.jpg';
$fh = fopen($name,'w+');
fwrite($fh,$data);
fclose($fh);
I have a topic/question concerning your upload filename standards, if any, that you are using. Imagine you have an application that allows many types of documents to be uploaded to your server and placed into a directory. Perhaps the same document could even be uploaded twice. Usually, you have to make some kind of unique filename adjustment when saving the document. Assume it is saved in a directory, not saved directly into a database. Of course, the Meta Data would probably need to be saved into the database. Perhaps the typical PHP upload methods could be the application used; simple enough to do.
Possible Filenaming Standard:
1.) Append the document filename with a unique id: image.png changed to image_20110924_ahd74vdjd3.png
2.) Perhaps use a UUID/GUID and store the actual file type (meta) in a database: 2dea72e0-a341-11e0-bdc3-721d3cd780fb
3.) Perhaps a combination: image_2dea72e0-a341-11e0-bdc3-721d3cd780fb.png
Can you recommend a good standard approach?
Thanks, Jeff
I always just hash the file using md5() or sha1() and use that as a filename.
E.g.
3059e384f1edbacc3a66e35d8a4b88e5.ext
And I would save the original filename in the database may I ever need it.
This will make the filename unique AND it makes sure you don't have the same file multiple times on your server (since they would have the same hash).
EDIT
As you can see I had some discussion with zerkms about my solution and he raised some valid points.
I would always serve the file through PHP instead of letting user download them directly.
This has some advantages:
I would add records into the database if users upload a file. This would contain the user who uploaded the file, the original filename and tha hash of the file.
If a user wants to delete a file you just delete the record of the user with that file.
If no more users has the file after delete you can delete the file itself (or keep it anyway).
You should not keep the files somewhere in the document root, but rather somewhere else where it isn't accessible by the public and serve the file using PHP to the user.
A disadvantage as zerkms has pointed out is that serving files through PHP is more resource consuming, although I find the advantages to be worth the extra resources.
Another thing zerkms has pointed out is that the extension isn't really needed when saving the file as hash (since it already is in the database), but I always like to know what kind of files are in the directory by simply doing a ls -la for example. However again it isn't really necessarily.
I would like to create an upload form with php. The problem is that it will be used to upload a fixed row length text file that would contain orders. (full order details would be duplicated for each row).
Then it should place the file somewhere and call a program that will read the file and place the orders. The problem is that i want to prevent the same order file to be sent to the order program.
The file wont have any unique identifier. I am wondering which is the best way to check that the file isnt the same. One solution is to calculate MD5 for each file and store them but i am not sure about concurency and whether this would work and how many files to compare with.
THe only solution i can figure out it to store some max(20) for example to a file and use flock() for this file to avoid concurency problems. Like program A checks if file exists via md5,program B checks if file exists via md5 (they may from a a non updated thats why i think i should use exclusive lock....
Any other solution ?
Store the MD5 hash (or SHA1) and size of the file in the database. Index the hash.
To check for duplicates, just search in the database for a file with the same hash and size.
To store uploaded files by users on remote server inside disk folder I change the name of file to
$filename = '/tmp/foo.txt';
$newName = sha1_file($filename); // 40 characters
//or I can do
$newName = uniqid($filename) // 13 characters
Which is a more robust method for new name that is not likely to fail ??
Thanks.
A better solution is to use tmpfile() or tempnam(). Either one is guaranteed to create an unused file that won't collide and can't be "intercepted" by rogue processes changing permissions on you. tmpfile() automatically deletes the file when it's closed, whereas tempnam() keeps it around
http://www.php.net/manual/en/function.tmpfile.php
http://www.php.net/manual/en/function.tempnam.php
Neither should give names which collide. sha1_file is a lot more compute intensive, but it has the useful property that if two users upload exactly the same file, it will be given the same name and you store it only once. If you don't expect a lot of people to upload the same file, or don't care about storing it twice, uniqid will run a lot faster.
In either cases you want to check whether the file already exists.
If you want to be 100% safe and the files are not too big then just use
sha1_file($filename);
This will pull the SHA-1 for the whole file so even if the file already exists the contents is the same.
Peace
Sha1 - as with any hash function has the advantage of beeing deterministic. In your case this might be unwanted.
Using just a hash-function for this will result in collisions on equal files (which can occur in real life).
Having uniqueid is better in this case.
Although the smaller range of 13 characters would indicate a much higher probability for collisisons, this is not the case, because it is rare, that 2 files are uploaded in the very same moment. Even than, using the filename as prefix (and by this increasing the length of your $newname) will save you from collisions in most cases.
If you want to make sure, you might want to add some loop checking for existing file, and rebuilding the name, until you have no collision (or some break-condition is given).
I'm generating a unique filename for uploaded files with the following code
$date = date( 'U' );
$user = $_SERVER[REMOTE_ADDR];
$filename = md5($date.$user);
The problem is that I want to use this filename again later on in the script, but if the script takes a second to run, I'm going to get a different filename the second time I try to use this variable.
For instance, I'm using an upload/resize/save image upload script. The first operation of the script is to copy and save the resized image, which I use a date function to assign a unique name to. Then the script processses the save and saves the whole upload, and assigns it a name. At the end of the script ($thumb and $full are the variables), I need to insert into a MySQL database, the filenames I used when i saved the uploads.
Problem is, sometimes on large images it takes more than a second (or during the process, the seconds change) resulting in a different filename being put into the database than is what the file is actually saved under.
Is this just not a good idea to use this method of naming?
AFAIK it's a great way to name the files, although I would check file_exists() and maybe tack on a random number.
You need to store that filename in a variable and reference it again later, instead of relying on the algorithm each time. This could be stored in the user $_SESSION, a cookie, a GET variable, etc between pageloads.
Hope that helps
Just want add that php has a function to create identifiers: uniqid. You can also prefix the identifier with a string (date maybe?).
Always validate your user's input, and the server headers!
I would recommend storing the file name in the session (as per AI). If you store it in one of the other variables, it is more likely for the end user to be able to attack the system through it. MD5 of user concatenated with rand() would be a nice way to get a long list of unique values. Just using rand() would probably have a higher percentage of conflicts.
I am not sure about the process that you are following for uploading files, but another way to handle file uploads is with PHP's built in handlers. You can upload the file and then use the "secure" methods for pulling uploaded files out of the temporary space. (the temporary space in this instance can be safely located outside of the open base dir directive to prevent tampering). is_uploaded_file() and move_uploaded_file() from: http://php.net/manual/en/features.file-upload.post-method.php example 2 might handle the problem you are encountering.
Definitely check for an existing file in that location if you are choosing a filename on the fly. If user input is allowed in any way shape or form, validate and filter the argument to make sure it is safe. Also, if the storage folder is web accessible, make sure you munge the name and probably the extension as well. You do not want someone to be able to upload code and then be able to execute it. That officially leads to BAD activities.
I just discovered that PHP has a built-in function for this, called tempnam. It even avoids race conditions. See http://php.net/manual/en/function.tempnam.php.
Why not to use
$filename = md5(rand());
This will be pretty much unique in every case. And if you find that $filename already exists you can just call it again.
Not a good idea using ID dependent on time – if you upload two images at the same time, the later one can overwrite the earlier. You should look at function such as uniqid(). However, if this upload/resize/save script is meant to be "single-user", then this is not such a big problem.
To the problem itself. If I were you, I would just save the computed filename to some variable a use the variable from that point. Computing already computed is waste of time. And when uploading some really big images, or more images at once, script can take even 20 seconds. You cannot depend on fact that you'll make everything you want in one second.