Best way to detect same files in php - php

I have a web server, where users upload their files. I want to implement logic, that will show to user, if he will try to upload same file twice.
First idea is to save md5_file() value to the db and then check if there are any files with same md5 value. Files size differs from 2 megabytes up to 300.
I heared that md5 have collisions. Is this ok to use it?
Is it effective to use such logic with 300 megabytes files?

Yes, this is exactly what hashing is for. Consider using sha1, it's an all around superior hashing algorithm.
No, you probably shouldn't worry about collisions. The odds of people accidentally causing collisions is extremely low, close enough to impossible that you shouldn't waste any time thinking about it up-front. If you are seriously worried about it, use the hash as a first check, and then compare the file sizes, then compare the files bit-by-bit.

MD5 collisions are rare enough that in this case it shouldn't be an issue.
If you are dealing with large files however, you'll have to remember you are essentially uploading the file any way before you even check if it is a duplicate.
Upload -> MD5 -> Compare -> Keep or Disregard.

If checking for duplicates, you can usually get away with using sha1.
Or to bulletproof it:
$hash = hash_file("sha512", $filename); // 128 char hex output
(And yes, with very large files md5 does indeed have a fairly high number of collisions)

Related

HASH for GIF IMAGES

I need to know if exists any form to get a unique hash from gif images, i did tried with SHA1 file function
sha1_file
but i don't know if exist the case where two hash of different gif images, result in same hash.
Its can happen with SHA1? In this case is better SHA2, or MD5? Or any other previously implemented in PHP language.
I know its also depends of file size, but gifs image don't exceed 10mb in any case.
I need recommendations for this problem. best regards.
There is no hash function that creates different values for each and every set of images you provide. This should be obvious as your hash values are much shorter than the files themselves and therefore they are bound to drop some information on the way. Given a fixed set of images it is rather simple to produce a perfect hash function (e.g. by numbering them), but this is probably not the answer you are looking for.
On the other hand you can use "perfect hashing", a two step hashing algorithm that guarantees amortized O(1) access using a two step hashing algorithm, but as you are asking for a unique 'hash' that may also not be what you are looking for. Could you be a bit more specific about why you insist on the hash-value being unique and under what circumstances?
sha1_file is fine.
In theory you can run into two files that hash to the same value, but in practice it is so stupendously unlikely that you should not worry about it.
Hash functions don't provide any guarantees about uniqueness. Patru explains why, very well - this is the pigeonhole principle, if you'd like to read up.
I'd like to talk about another aspect, though. While you won't get any theoretical guarantees, you get a practical guarantee. Consider this: SHA-256 generates hashes that are 256 bits long. That means there are 2256 possible hashes it can generate. Assume further that the hashes it generates are distributed almost purely randomly (true for SHA-256). That means that if you generate a billion hashes a second, 24 hours a day, you'll have generated 31,536,000,000,000,000 hashes a year. A lot, right?
Divide that by 2256. That's ~1060. If you walked linearly through all possible hashes, that's how many years it would take you to generate all possible hashes (pack a lunch). Divide that by two, that's... still ~1060. That's how many years you'd have to work to have a greater than 50% chance of generating the same hash twice.
To put it another way, if you generate a billion hashes a second for a century, you'd have a 1/1058 chance of generating the same hash twice. Until the sun burns out, 1/1050.
Those are damn fine chances.

Avoiding hash collision in php when using sha1 for hashing

Suppose i assume if hash collision occur while i am using sha1 function in php .
Will this code avoids it permanently or do i have to use any other way
$filename=sha1($filename.'|'.microtime());
OR
$filename=sha1($filename.'|'.rand());
If no this code doesn't provide protection from hash collision .
What should i do to avoid any type of hash collision if i assume there can be more than 100,000 entries in db.
Its very unlikely that a hash collision will happen for sha1.
Probability of sha1 collision is negligible
And hash collision risk is not practical. No one has found sha1 collision till yet . So you are safe to use it.
Using a salt like microtime or random number may decreases the chances of probability but you simply can't avoid it.
And what you are using is sha1(string) whether string is a mixed value or single string.so using microtime and rand function wont affect anything to probability of hash collision.
Therefore there can be possibility that sha1(mixedvalue) collision might be equal or greater than collision of sha1(filename) so certainly that is of no use.
So dont worry and use this or simple way if you like to, it wont create problem in future, Thinking about hash collision is waste of time when the chances are very very very less.
Just to be clear, you can't completely avoid hash collisions. It's an infinite number of inputs to a finite number of outputs, but you can take into account things like the file's size, the current system time and other data to use as a salt which will increase the entropy of your message digests.
Just sha1() the entire file path, not only the file name.
Filename xy.png can be only one in a directory, therefore your hash will be unique for that filename.
Also, this has the advantage that you will not have duplicate files (while with rand()/microtime() you can get same file 10 times in same dir, and if it's a 1GB file can cause problems)
Neither of these avoid hash collision.
Hash collisions happen when you have an algorithm that generates a hash of a specific size, regardless of the starting value.
A hash collision is when two different values, like "mypassword" and "dsjakfuiUIs2kh-1jlks" end up generating the same hash because of the mathematical operations performed on them.
You can't write code to prevent hash collisions, how often that happens is dependent on the hashing algorithm you are using.

Amazon S3 creating unique keys for every object

My app users upload their files to one bucket. How can I ensure that each object in my S3 bucket has a unique key to prevent objects from being overwritten?
At the moment I'm encrypting filenames with a random string in my php script before sending the file to S3.
For the sake of the discussion let's suppose that the uploader finds a way to manipulate the filename on upload. He wants to replace all the images on my site with a picture of a banana. What is a good way to prevent overwriting files in S3 if encryption fails?
Edit: I don't think versioning will work because I can't specify a version id in an image URL when displaying images from my bucket.
Are you encrypting, or hashing? If you are using md5 or sha1 hashes, an attacker could easily find a hash collision and make you slip on a banana skin. If you are encrypting without a random initialization vector, an attacker might be able to deduce your key after uploading a few hundred files, and encryption is probably not the best approach. It is computationally expensive, difficult to implement, and you can get a safer mechanism for this job with less effort.
If you prepend a random string to each filename, using a reasonably reliable source of entropy, you shouldn’t have any issues, but you should check whether the file already exists anyway. Although coding a loop to check, using S3::GetObject, and generate a new random string might seem like a lot of effort for something that will almost never need to run, "almost never" means it has a high probability of happening eventually.
Checking for a file with that name before uploading it would work.
If the file already exists, re-randomize the file name, and try again.

Best hash algorithm for a data index (ie, crc)

Basically, I'm keeping track of file modifications, in something like:
array (
'crc-of-file' => 'latest-file-contents'
)
This is because I'm working on the file contents of different files at runtime at the same time.
So, the question is, what hashing algorithm should I use over the file contents (as a string, since the file is being loaded anyway)?
Collision prevention is crucial, as well as performance. I don't see any security implications in this so far.
Edit: Another thing I could have used instead of hashing contents is the file modification timestamp, but I wasn't sure how reliable it is. On the other hand, I think it's faster to monitor the said stamp than hashing the file each time.
CRC it's not a hashing algorithm, a checksum algorithm so your chances of collision will be quite high.
md5 is quite fast and the collision risk is rather minimal for your kind of application / volume. If you are buffering the file, you may also want to look at incremental hashes using the hash extension.
A bit more complex, but also worth looking at (if you have it) is the Inotify extension.

Is SHA sufficient for checking file duplication? (sha1_file in PHP)

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?
For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?
Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?
One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...
Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.
sha1_file good enough?
Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:
function is_duplicate_file( $file1, $file2)
{
if(filesize($file1) !== filesize($file2)) return false;
if( sha1_file($file1) == sha1_file($file2) ) return true;
return false;
}
md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.
Scalability?
There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:
1- Direct file compare:
if( file_get_contents($file1) != file_get_contents($file2) )
2- Sha1_file
if( sha1_file($file1) != sha1_file($file2) )
3- md5_file
if( md5_file($file1) != md5_file($file2) )
The results:
2 files 1.2MB each were compared 100 times, I got the following results:
--------------------------------------------------------
method time(s) peak memory
--------------------------------------------------------
file_get_contents 0.5 2,721,576
sha1_file 1.86 142,960
mdf5_file 1.6 142,848
file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.
Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.
md5_file might be a better option because it is a little faster than sha1.
So the conclusion is that it depends, if you want faster compare, or less memory usage.
As per my comment on #ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.
From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?
In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.
SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:
A.1. SHA1 Weaknesses
As time passes, cryptographers discover more and more SHA1
weaknesses. Already, finding hash
collisions is feasible for well-funded organizations. Within
years, perhaps even a typical PC will
have
enough computing power to silently corrupt a Git repository.
Hopefully Git will migrate to a better hash function before further
research destroys SHA1.
You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.
Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.
With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

Categories