Is SHA sufficient for checking file duplication? (sha1_file in PHP)

Is SHA sufficient for checking file duplication? (sha1_file in PHP) - php

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?
For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?
Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?
One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...
Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.

sha1_file good enough?
Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:
function is_duplicate_file( $file1, $file2)
{
if(filesize($file1) !== filesize($file2)) return false;
if( sha1_file($file1) == sha1_file($file2) ) return true;
return false;
}
md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.
Scalability?
There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:
1- Direct file compare:
if( file_get_contents($file1) != file_get_contents($file2) )
2- Sha1_file
if( sha1_file($file1) != sha1_file($file2) )
3- md5_file
if( md5_file($file1) != md5_file($file2) )
The results:
2 files 1.2MB each were compared 100 times, I got the following results:
--------------------------------------------------------
method time(s) peak memory
--------------------------------------------------------
file_get_contents 0.5 2,721,576
sha1_file 1.86 142,960
mdf5_file 1.6 142,848
file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.
Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.
md5_file might be a better option because it is a little faster than sha1.
So the conclusion is that it depends, if you want faster compare, or less memory usage.

As per my comment on #ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.
From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?
In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.

SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:
A.1. SHA1 Weaknesses
As time passes, cryptographers discover more and more SHA1
weaknesses. Already, finding hash
collisions is feasible for well-funded organizations. Within
years, perhaps even a typical PC will
have
enough computing power to silently corrupt a Git repository.
Hopefully Git will migrate to a better hash function before further
research destroys SHA1.
You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.

Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.
With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

Related

HASH for GIF IMAGES

I need to know if exists any form to get a unique hash from gif images, i did tried with SHA1 file function
sha1_file
but i don't know if exist the case where two hash of different gif images, result in same hash.
Its can happen with SHA1? In this case is better SHA2, or MD5? Or any other previously implemented in PHP language.
I know its also depends of file size, but gifs image don't exceed 10mb in any case.
I need recommendations for this problem. best regards.

There is no hash function that creates different values for each and every set of images you provide. This should be obvious as your hash values are much shorter than the files themselves and therefore they are bound to drop some information on the way. Given a fixed set of images it is rather simple to produce a perfect hash function (e.g. by numbering them), but this is probably not the answer you are looking for.
On the other hand you can use "perfect hashing", a two step hashing algorithm that guarantees amortized O(1) access using a two step hashing algorithm, but as you are asking for a unique 'hash' that may also not be what you are looking for. Could you be a bit more specific about why you insist on the hash-value being unique and under what circumstances?

sha1_file is fine.
In theory you can run into two files that hash to the same value, but in practice it is so stupendously unlikely that you should not worry about it.

Hash functions don't provide any guarantees about uniqueness. Patru explains why, very well - this is the pigeonhole principle, if you'd like to read up.
I'd like to talk about another aspect, though. While you won't get any theoretical guarantees, you get a practical guarantee. Consider this: SHA-256 generates hashes that are 256 bits long. That means there are 2256 possible hashes it can generate. Assume further that the hashes it generates are distributed almost purely randomly (true for SHA-256). That means that if you generate a billion hashes a second, 24 hours a day, you'll have generated 31,536,000,000,000,000 hashes a year. A lot, right?
Divide that by 2256. That's ~1060. If you walked linearly through all possible hashes, that's how many years it would take you to generate all possible hashes (pack a lunch). Divide that by two, that's... still ~1060. That's how many years you'd have to work to have a greater than 50% chance of generating the same hash twice.
To put it another way, if you generate a billion hashes a second for a century, you'd have a 1/1058 chance of generating the same hash twice. Until the sun burns out, 1/1050.
Those are damn fine chances.

PHP MD5 Performance = is it processor hungry?

md5() is usually for passwords, short strings
but this time I wanted to encrypt (w/o needing to decrypt) a large string, like a whole article ... (not to mention i'd be needing to do this every few seconds)
Would this be a problem? or would it take processor more work/longer to md5 a big string vs a short password?
I read md5 is really really fast..
For those curious, I'm trying to generate a "signature" of the string in question

Oh dear.
Well, MD5 is not encryption.
(Encryption is, by definition, designed to be reversible.)
Do not use MD5 or SHA for password hashes - they are too fast!(And MD5 is plain broken for such a task.)
Hash algorithms (including MD5) take time proportional to the size of the input, or O(n). That means, it will take about "twice as long" to hash 100MB as it does to hash 50MB. For in-memory PHP strings this will be "in the blink of an eye" (as the I/O will most likely be the bottleneck) - you'll need to run performance benchmarks on real data in a real environment to quantify it.
MD5 is indeed "really really fast"; the algorithm is relatively simple and, like many hash algorithms, was designed to be fast. Don't worry about performance until there is a real performance issue - modern CPUs are very fast. Also, while MD5 (and SHA) is fast, running MD5 back-to-back-to-back as in an infinite loop will of course "eat" all the CPU; an idle CPU is a wasted CPU if there is work to be done.
However, consider SHA (preferable SHA-2) for a "general" signature hashing - it is only marginally slower (by a constant factor) but it is a better algorithm, even when trimmed to the same output space, and just might prevent issues in the future.

Best way to detect same files in php

I have a web server, where users upload their files. I want to implement logic, that will show to user, if he will try to upload same file twice.
First idea is to save md5_file() value to the db and then check if there are any files with same md5 value. Files size differs from 2 megabytes up to 300.
I heared that md5 have collisions. Is this ok to use it?
Is it effective to use such logic with 300 megabytes files?

Yes, this is exactly what hashing is for. Consider using sha1, it's an all around superior hashing algorithm.
No, you probably shouldn't worry about collisions. The odds of people accidentally causing collisions is extremely low, close enough to impossible that you shouldn't waste any time thinking about it up-front. If you are seriously worried about it, use the hash as a first check, and then compare the file sizes, then compare the files bit-by-bit.

MD5 collisions are rare enough that in this case it shouldn't be an issue.
If you are dealing with large files however, you'll have to remember you are essentially uploading the file any way before you even check if it is a duplicate.
Upload -> MD5 -> Compare -> Keep or Disregard.

If checking for duplicates, you can usually get away with using sha1.
Or to bulletproof it:
$hash = hash_file("sha512", $filename); // 128 char hex output
(And yes, with very large files md5 does indeed have a fairly high number of collisions)

Bcrypt(4) (=4 iterations) versus SHA512 or something different with unique salt per password?

Background:
I want to add a login to my small site, which is an online php application, which I'd like to build to be able to bear much user activity in the future.
Before I look further into implementing LightOpenID I want to add a normal login. The book I was learning from is called Head First PHP & MySQL (2008) and the final code of the chapter uses SHA('$user_password') as part of the mysql query.
As I take interest in Jeff Atwood's writing I'm well aware of bcrypt as of scrypt. But seen as there's no php implementation of scrypt and having no dedicated server to run it, I decided to at least look into implementing bcrypt for now.
However I'm not completely naive, I know I should watch out not to overextend my very humble hosting resources. The php app itself should always come first before anything else concerning resources.
Andrew Moore's method seems nice (though I'll have to see how to implement it on php 5.2.17 which my host uses) and it comes with a tip for hardware speed:
You should select a number of rounds that results in 200-250 ms of
work. Part of the reason why bcrypt is secure is that it is slow. You
must ensure to have a number of rounds that keeps that characteristic.
– Andrew Moore
Another user states that for him running microtime() gives 0.314 for Bcrypt(9), which thus would be near optimal.
The question:
Seen as I only have very humble resources at my disposal and I'd like to make the best of them, leaving most for the php app itself, am I still better off using Bcrypt(4) instead of something else?
Bcrypt(4) returns true almost instantly, but does it still keep that characteristic Moore talks about?(Would that be the part concerning RAM that makes it harder for GPU bruteforcing?) Or would SHA512 or something else actually be as fast but more secure at this point?
I'd expect Bcrypt(4) to win in this situation, but the hell do I know right? :p

Security is always about what you are trying to secure.
If you are more concerned about your resources than about your security, bcrypt(2) is already overkill. No hacker will ever try to break that for a normal application, having easier target sites like LinkedIn and many others, which just use functions from the sha family, with a single iteration, and unsalted. They will go for the 'low hanging fruit'. Or they could keep trying to hack your application, just not in the password encryption part.
SHA-512 is not much more secure than SHA-1 as password hashing algorithm [1], it has not been designed for that purpose. They can still be used as primitives for creating secure cryptographic algorithms though, but that's something no single person should do. To be considered secure, crypto algorithms must be public to be peer reviewed, and must pass the test of time. And obviously, must be designed for what you are going to use them. MD5, SHA-X, etc. are cryptographic algorithms, but weren't designed for storing passwords.
Just add or remove rounds to your bcrypt. In this case I would use 1 or 2. Also keep in mind that 1 round != 1 iteration. They are increased exponentially. If you read about how bcrypt works, you will see that there is much more to it than just iterations. For example, you mentioned 'unique salt per password'. Bcrypt already has that.
[1] For other things it's obviously more secure

You should look at security of the system, not just of bcrypt.
Certainly, if you want to store passwords, bcrypt or PBKDF2 is the way to proceed. Make sure you use a sufficiently large, random salt per user or password. Then try to maximize the number of iterations. If that's small, then it is small, but any iteration more is better than nothing.
Note that this does little against eavesdropping or man in the middle attempts (MitM). You should use SSL for that; the password or the hash (if you do the hashing client side) can be replayed otherwise.
Furthermore, if you want to protect against brute force attacks (attackers trying the most common passwords) you should create (or copy) a good password management scheme. Limit the amount of incorrect logins and try to let the users create strong passwords. Also limit the amount of information you return to your user regarding incorrect logins, that user may be the attacker.

Or would SHA512 or something else actually be as fast but more secure at this point?
Slowness is a major feature of password hashing algorithms (of which, bcrypt is one, but SHA-512 by itself is not) - the slower your algorithm is (relative to other algorithms), the harder it is for an attacker to brute force passwords based on the hashes. From this perspective, a single round of SHA-512 is less suitable than bcrypt for the purpose of securely storing passwords, because it is considerably faster.
In my opinion, the best approach to take is to pick a password hashing algorithm (bcrypt, PBKDF2, scrypt) and then tune the work factor to give you the best tradeoff between speed and security, given the computing resources available to you and the characteristics of your system. A higher work factor = more secure, but also more resource-intensive.
The good news is that users typically use your login function infrequently compared to other functions, so the impact of a slower/resource intensive login function is generally not a big problem.

Best hash algorithm for a data index (ie, crc)

Basically, I'm keeping track of file modifications, in something like:
array (
'crc-of-file' => 'latest-file-contents'
)
This is because I'm working on the file contents of different files at runtime at the same time.
So, the question is, what hashing algorithm should I use over the file contents (as a string, since the file is being loaded anyway)?
Collision prevention is crucial, as well as performance. I don't see any security implications in this so far.
Edit: Another thing I could have used instead of hashing contents is the file modification timestamp, but I wasn't sure how reliable it is. On the other hand, I think it's faster to monitor the said stamp than hashing the file each time.

CRC it's not a hashing algorithm, a checksum algorithm so your chances of collision will be quite high.
md5 is quite fast and the collision risk is rather minimal for your kind of application / volume. If you are buffering the file, you may also want to look at incremental hashes using the hash extension.
A bit more complex, but also worth looking at (if you have it) is the Inotify extension.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.