I have written a script to upload an image to a particular portion of my site.
What kind of check do I need to do to detect if a duplicate entry is trying to be uploaded through the form?
Example:
One user submits firefoxlogo.jpg.
Another user a few days later tries submitting firefox.jpg which is the same image, just renamed.
...the same image...
The same as "the binary data is identical" or "the image looks similar"? In the first case, you can calculate the hash of a file using sha1_file (for SHA1 hashes). You should never rely on the filename for checking whether files are unique or not. E.g. one user could upload "firefox.png" containing the browser's logo and someone else a screenshot of it. The hash has a fixed length (40 for SHA1) which is another advantage over using filenames.
Each time a user uploads a file you could keep a record of it's sha1 hash (using sha1_file) in your database. When you get a file upload, grab the hash of the new file while it's still in temporary storage, then query your database for an entry with the same hash. If none exists, you can continue to upload the file.
see this too http://php.net/exif , not totally secure to avoid duplicates, but a faster solution to sha1_file
hope this helps
Related
I'm writing a file upload site, and am interested in saving space. If a user uploads a file, I want to ensure this file has not already been uploaded before (if it has been, I will just point to the existing file in the database).
I was considering using sha1_file() on the file, checking the database to see if the digest exists in a database of digests. Then I remembered the pigeonhole principle, and decided to check the undigested files against each other if there is a sha1 digest match.
This seems inefficient to me. I figure I could just check the first kilobyte of each file against each other in the event of a check sum match.
I haven't thought too much about the value of RAM versus ROM, and it might be possible that the processing power required to check the files costs more than the storage space I would save.
Are there any shortcomings to this method? Am I wasting my time in even bothering with this?
you could use md5( file_data ) to generate the names of the files and it will never be possible to upload the same file with a different name. only problem with this is that it could be technically possible that two different files generate the same md5, but its unlikely, especially if the two files have the same extension, so you could consider this a non problem. under this schematic, there is no reason to even check. if two hashes are the same, it simply overwrites the stored file. this is how most file storage engines work internally, such as zimg. if you are paranoid about collisions, you could see first if the file exists with the computed hash and extension, and if it does you could compare the data of that stored file vs the data of the file that you are attempting to store. if the data is inequal, you could have it email you an alert.
$data = file_get_contents('flowers.jpg');
$name = md5($data).'.jpg';
$fh = fopen($name,'w+');
fwrite($fh,$data);
fclose($fh);
Im working on a image hosting website, and to prevent "already exists" error, i md5 the images that are being uploaded, the problem is that the URL to that website is already quite long plus the whole MD5 hash makes it even longer, is there any way to make the URL shorter?
It's not necessary to use the md5 string as your image filename. To ensure the uniqueness of the images, you can try the following solution:
md5() every new image a user uploads
Store the md5() value in a database
Next time a user uploads an image, check if the item already exists in your database
If exists, prevent user from uploading the image. Else, proceed.
Repeat
you can keep a id to the hash value mapping with you on the image hosting server. You can store this mapping in redis or mysql as both are persistant databases.
you can use name of image and time when uploaded to make it unique but shorter. use like this
$img_name = $uploaded_name.time().$file_ext;
Hence the name will be shorter but unique.
Just use the unix timestamp to ensure a new and unique file name all the time and keep is length shorter as well.
I am using a form to upload files(images) to my server. How can I possibly prevent the same image from being uploaded twice?? I cannot possibly look wether the a image by same title exists as same images can have different titles and different images can have same title.
Any help is appreciated.
Create a hash like ZombieHunter suggested. Why? Because is easy and fast to search and check through a big table of hashes if the image already exists. Unfortunately all this hash metdods like md5 or md5_file work on existing files not on remote ones. So you will have to upload the file anyway. What you can do is then decide if you want to keep or not the file. If you are fetching the files from an online resource, maybe there are ways to detect from headers the file size and run a hash without downloading it, but this is a special case.
Also if you have other business logic attached to those images, with concepts like userHasImages or companyHasImages you can organize them in namespaces/folders/tags so you can speed the search even further.
In terms of database strictly speaking prevention of duplicate entries, use an unique index for the column that contains the hash.
I'm trying to allow users to upload files through a PHP website. Since all the files are saved in a single folder on the server, it's conceivable (though admittedly with low probability) that two distinct users could upload two files that, while different, are named exactly the same. Or perhaps they're exactly the same file.
In the both cases, I'd like to use exec("openssl md5 " . $file['upload']['tmp_name']) to determine the MD5 hash of the file immediately after it is uploaded. Then I'll check the database for any identical MD5 hash and, if found, I simply won't complete the upload.
However, in the move_uploaded_file documentation, I found this comment:
Warning: If you save a md5_file hash in a database to keep record of uploaded files, which is usefull to prevent users from uploading the same file twice, be aware that after using move_uploaded_file the md5_file hash changes! And you are unable to find the corresponding hash and delete it in the database, when a file is deleted.
Is this really the case? Does the MD5 hash of a file in the tmp directory change after moving it to a permanent location? I don't understand why it would. And regardless, is there another, better way of ensuring the same file is not uploaded to the filesystem multiple times?
If you're convinced by all the reasons given here in the answers and decide not to use md5 at all (I'm still not sure whether you WANT to or MUST use hash), you can just append something unique for each user and the time of uploading to each file name. That way you'll end up with more readable file names. Something like: $filename = "$filename-$user_ip_string-$microtime";. Of course, you must have all three variables ready and formatted before that, it goes without saying.
No chance of the same file name, same IP address and same microtime occuring at the same time, right? You could easily get away with microtime only, but IP will make it even more certain.
Of course, like I said, all this goes if you decide not to use hashing and go for a simpler solution.
Shouldn't you use exec("openssl md5 " . $file['upload']['name']) name instead? I'm thinking that the temporary name differs from upload to upload.
It would seem that it indeed is the case. I have shortly been looking through the docs aswell. But why dont you share the md5 checksum before using move_uploaded_file and store that value in your database linking it directly with the new file? That was you can always check the uploaded file and whether that file already exists in your filesystem.
This does require a database, but most have access to one.
Try renaming the uploaded file to a unique id.
Use this:
$dest_filename = $filename;
if (RENAME_FILE) {
$dest_filename = md5(uniqid(rand(), true)) . '.' . $file_ext;
}
Let me know if it helps :)
No, in general the hash doesn't change by move_uploaded_file somehow magically.
But, if you compute the md5() including the file's path, the hash will certainly change if the file is move to a new path/folder.
In case you md5() the filename, nothing will change.
It's a good idea to rename uploaded files with a unique name.
But don't forget to locate the file to finally store the file, is outside of your document root folder of your vHost. Located there, it can't be downloaded without using a PHP-script.
Final remark: While it's very very unlikely, md5 hashed of two different files may be identical.
I would like to create an upload form with php. The problem is that it will be used to upload a fixed row length text file that would contain orders. (full order details would be duplicated for each row).
Then it should place the file somewhere and call a program that will read the file and place the orders. The problem is that i want to prevent the same order file to be sent to the order program.
The file wont have any unique identifier. I am wondering which is the best way to check that the file isnt the same. One solution is to calculate MD5 for each file and store them but i am not sure about concurency and whether this would work and how many files to compare with.
THe only solution i can figure out it to store some max(20) for example to a file and use flock() for this file to avoid concurency problems. Like program A checks if file exists via md5,program B checks if file exists via md5 (they may from a a non updated thats why i think i should use exclusive lock....
Any other solution ?
Store the MD5 hash (or SHA1) and size of the file in the database. Index the hash.
To check for duplicates, just search in the database for a file with the same hash and size.