Best hash algorithm for a data index (ie, crc)

Best hash algorithm for a data index (ie, crc) - php

Basically, I'm keeping track of file modifications, in something like:
array (
'crc-of-file' => 'latest-file-contents'
)
This is because I'm working on the file contents of different files at runtime at the same time.
So, the question is, what hashing algorithm should I use over the file contents (as a string, since the file is being loaded anyway)?
Collision prevention is crucial, as well as performance. I don't see any security implications in this so far.
Edit: Another thing I could have used instead of hashing contents is the file modification timestamp, but I wasn't sure how reliable it is. On the other hand, I think it's faster to monitor the said stamp than hashing the file each time.

CRC it's not a hashing algorithm, a checksum algorithm so your chances of collision will be quite high.
md5 is quite fast and the collision risk is rather minimal for your kind of application / volume. If you are buffering the file, you may also want to look at incremental hashes using the hash extension.
A bit more complex, but also worth looking at (if you have it) is the Inotify extension.

Related

Best way to detect same files in php

I have a web server, where users upload their files. I want to implement logic, that will show to user, if he will try to upload same file twice.
First idea is to save md5_file() value to the db and then check if there are any files with same md5 value. Files size differs from 2 megabytes up to 300.
I heared that md5 have collisions. Is this ok to use it?
Is it effective to use such logic with 300 megabytes files?

Yes, this is exactly what hashing is for. Consider using sha1, it's an all around superior hashing algorithm.
No, you probably shouldn't worry about collisions. The odds of people accidentally causing collisions is extremely low, close enough to impossible that you shouldn't waste any time thinking about it up-front. If you are seriously worried about it, use the hash as a first check, and then compare the file sizes, then compare the files bit-by-bit.

MD5 collisions are rare enough that in this case it shouldn't be an issue.
If you are dealing with large files however, you'll have to remember you are essentially uploading the file any way before you even check if it is a duplicate.
Upload -> MD5 -> Compare -> Keep or Disregard.

If checking for duplicates, you can usually get away with using sha1.
Or to bulletproof it:
$hash = hash_file("sha512", $filename); // 128 char hex output
(And yes, with very large files md5 does indeed have a fairly high number of collisions)

why its written md5 along with download of any language such as php,ruby etc?

This question may sound silly but while navigating any computer language site i usually encounter a MD5 value written within bracket or beside the download link of the language.
why MD5 is been provided ?
What is its use there ? Does it help in downloading process ? which value's MD5 is given over there ? Is it there release version's MD5 value.
Such as:
PHP //MD5 is privided
Ruby //MD5 is given
Python //again MD5
Why is it so ?

If some of the downloading mirrors will be hacked to inject code in binaries, md5 of binaries will be changed. So by checking md5 of downloaded file you can be sure that your file is not modified.

md5 is not strictly speaking encryption. This is a message-digest ("hash") algorithm. It is not reversible. Think of this as a finger print for the file. Other algorithms are available: SHA-1, SHA-256 to name few popular options.
Obviously, all hash algorithms are vulnerable to collision (there is a finite number of hash values, but an infinite number of input documents). The chances of collision "by mistake" are small, but it has been proved since something like 10 years ago that one could forge a fake document having the same md5 hash as the original. In fact, for md5, this is even worst, because I'm able to choose the prefix of the "fake" document, and by adding some "magic garbage" at the end of it, to produce a "fake" archive with the same md5 hash as the original.
As of today, the main use of md5 should be to check the integrity of a document against download errors. Not much more than that.
It shouldn't be used to protect against malicious tempering.

Absolute fastest method of hashing a string in PHP

What is the absolute fastest way to hash a string in PHP?
I have read that md5 can be relatively slow but am unsure of the alternatives.
Basically, i have a function that i need to squeeze every last bit of performance possible out of and within that function i have a string say "yada yada yada" and i need it hashed in someway so it becomes one string.
I should note that security is no issue here - i simply need a single unique string representation, as its for a cache key.

The whole point of a hash is that it's -not- fast. The faster the hash is the faster it can be cracked.
By that logic, the less secure the hash is - the faster it'll be. If you're going to favour such logic I suggest you either stop what you're doing or use encryption instead.
In response to your update
It sounds like you may want a CRC. Again it's worth mentioning that typically the faster the check is the less combinations exist for the particular algorithm, and thus it's less likely to be a "unique representation".
The associated PHP documentation can be found here: hash function with crc32/crc32b

Benchmarks. I seem to recall reading somewhere that this depends a lot of your version of apache and PHP, can't remember where though. I'll post if I remember :)

Webserver on the fly decrypting?

I am dealing with a concept for a project that involes absolutely critical data.
The most important part is that it needs to be stored encrypted.
An encrypted file system that is mounted from which the webserver serves the files is not enough.
The key to decrypt the data should be passed in the request URI on a secured connection along with a hash and a timestamp.
The hash, based on timestamp, key and filename validates the URI and stores it on a list, so it can only be accessed once.
The important part now is that the webserver should take the file from the disk, and serve it decrypted using they key he got from the request URI.
It should also be efficient and fast. This also requires an encryption method that does not require the whole file to be scanned. so that the file can progressively be decrypted. I think AES can do this with specified block sizes that are encrypted atomic.
So one option would be reading the source file into a php script in chunks of some megs where i decrypt using aes and print the decrypted content. The script then forgets the previous data and continues with the next chunk until eof.
If aes doesnt support that i can just encrypt chunks of defined size of the file seperately, concatenate them and do it the same when serving the files. however i would like to stick to one standard that i dont have to re invent, so i can also use standard libraries to encrypt the files.
However this will be very inefficient.
Do you know of any apache/lighttpd/nginx module or some better method?

You should open the file with nmap() and then encrypt the data on-the-fly as needed.
I don't see anything more appropriate for this than G-Wan (200 KB), which offers native C scripts and AES encryption (no external libraries needed even if C scripts can link with any existing library).
If you need to achieve the best possible performances, then this is the way to go.

You may want to look into PHP's Stream Filters ( http://php.net/stream.filters ); with a bit of glue code, you could make it read an encrypted file with the regular PHP file access functions, and it would be mostly transparent to existing code.

If you can't find a PHP module that lets you decrypt the files chunk/block-wise, you can always pre-split the file into appropriate sized blocks and encrypt each seperately.
Of course, remember that even if you're only sending out small pieces of the plaintext at a time, there's still plenty of other places that this vulnerable data can be held - particularly in the web server's output buffers. Consider the extreme case of a large-ish file being downloaded by someone stuck on a 2400 baud modem. You may very well decrypt and queue the entire file before even the first chunk's been downloaded, leaving the entire file in the clear in a buffer somewhere.

There's no off-the-shelf solution to provide what you require. And while you've provided a bit of information about the data will be retrieved, you've not given much clues as to how the data will get on to the webserver in the first place.
You're jumping through lots of hoops to try to ensure that the data is not compromised - but if you're decrypting it on the server, then there is not only a risk of the data being compromised - but also that the key will be compromised. i.e. there's more theatre than substance in the architecture.
You seem to be flexible in the algorithm used for the encryption - which implies that you have some control over the architecture - so there is some scope to resolve these problems.
The hash, based on timestamp, key and filename validates the URI and stores it on a list, so it can only be accessed once.
How does that ensure it is only accessed once? Certainly it could be used to reduce the window of opportunity for CSRF - but it does not eliminate it.
The script then forgets the previous data and continues with the next chunk until eof.
This fundamentally undermines the objective of encryption - patterns within the data will still be apparent - and this provides a machanism for leveraging brute force attacks against the data - even if the block size is relatively large. Have a look at the images here for a simple demonstration.
A far more secure approach would be to use CBC, and do the encryption/decryption on the client.
There are javascript implementations of several encryption algorthms (including AES) this page has a good toolkit. And with HTML5 / localstorage you can build a complete clientside app in HTML/javascript.
As you're starting to discover - just using a clever encryption algorithm does not make your application secure - it sounds like you need to go back and think about how you store and retrieve data before you worry about the method you use for encrypting it.

Is SHA sufficient for checking file duplication? (sha1_file in PHP)

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?
For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?
Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?
One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...
Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.

sha1_file good enough?
Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:
function is_duplicate_file( $file1, $file2)
{
if(filesize($file1) !== filesize($file2)) return false;
if( sha1_file($file1) == sha1_file($file2) ) return true;
return false;
}
md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.
Scalability?
There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:
1- Direct file compare:
if( file_get_contents($file1) != file_get_contents($file2) )
2- Sha1_file
if( sha1_file($file1) != sha1_file($file2) )
3- md5_file
if( md5_file($file1) != md5_file($file2) )
The results:
2 files 1.2MB each were compared 100 times, I got the following results:
--------------------------------------------------------
method time(s) peak memory
--------------------------------------------------------
file_get_contents 0.5 2,721,576
sha1_file 1.86 142,960
mdf5_file 1.6 142,848
file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.
Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.
md5_file might be a better option because it is a little faster than sha1.
So the conclusion is that it depends, if you want faster compare, or less memory usage.

As per my comment on #ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.
From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?
In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.

SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:
A.1. SHA1 Weaknesses
As time passes, cryptographers discover more and more SHA1
weaknesses. Already, finding hash
collisions is feasible for well-funded organizations. Within
years, perhaps even a typical PC will
have
enough computing power to silently corrupt a Git repository.
Hopefully Git will migrate to a better hash function before further
research destroys SHA1.
You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.

Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.
With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.