Uniquely identify Files and Directories on a Server for Comparison

Uniquely identify Files and Directories on a Server for Comparison - php

What would be the best way to compare files and or directories. Lets say I want to store files on a sever or collective of servers like a cloud based system. My users are in collaboration with one another in many cases and some not. Either way I can have upwards of a hundred people or more with the same exact file. Only key difference is they likey renamed the file or whatever. But essentially same exact data all around. Now other thing is there is no specific file type. There's pdf, doc, docx, txt, videos, audio files, etc.. but the issue boils down to the same files over and over. What i want to do is cut it down. Remove the hundreds of dupes and with the help of a database store things like the file name the user provided so I can in turn store the single file left how and where I want while still providing the info they used essentially.
Now i know I can do something with md5 or sha1 or sha2 or something equivalent that will essentially give me a unique value I can use for such comparisons. But i am not exactly sure how or where to begin with that. Such as how with php can I get the sha or md5 of a file? When i look up stuff for those I get methods for strings but not files..
Overall I am here looking to bounce ideas around to figuring this out not so much as a direct means.. any help would be great.

$filePath = '/var/www/site/public/uploads/foo.txt'
$data = file_get_contents($filePath);
$key = sha1($data); //or $key = sha1_file($filePath);
Save this $key in a column of table also mark that column as UNIQUE so no to same file can be stored by default.
Use sha1 instead of md5 since many version control system like git use sha1 hash itself to identify uniqueness of file

When a file is uploaded:
Compute the hash (SHA1, etc.)
Rename the file to that hash and store it (unless a file with that hash already exists [you already have it])
Store the hash in your database.
When a file is requested:
Get the hash from your database
Return the file based on the hash
Use HTTP headers so the user's browser provides it to them with the filename they used

To get the md5 hash of a file at $path...
$hash = md5(file_get_contents($path));
Hope this partially answers your question.

There are many ways you can accomplish such a system. But if I'd have to write one from scratch, this is most likely how I would do it :
have three database tables (in pseudocode) :
table users {
id integer ## PK
username string
password string ## sha1
...
}
table user_files {
user_id integer ## Composite INDEX
file_id integer ##
filename string
}
table files {
id integer ## PK
uniq_id string ## basically 'yyyMMddhhmmssRRRR' INDEX
sha_hash string ## sha1
md5_hash string ## md5
}
Where files.sha_hash is the result of computing the sha1 of the file, files.md5_hash is the result of computing the md5 of the same file, as double security check, and files.filename the actual file name. On the server, the file would be stored and renamed to files.uniq_id to make sure there is no name collision, where the last RRRR chars represents a random string (cycle RRRR until uniq_id is unique in the database)
Note : PHP provides sha1_file and md5_file. Use these when computing files!
When a user stores a file, process the file (describe in step 1) and save it appropriately. To avoid having too many files in the same folder on the server, you may decompose files.uniq_id and separate each files into yyyy/MM sub folders.
Next, associate user_files.file_id = files.id and user_files.user_id = users.id and set user_files.filename to the uploaded file name (see next step).
If a user uploads another file, process the result as in 2, but check whether the result match another files.sha_hash, files.md5_hash. At this point, if we have a match, it doesn't matter what name the file has, it's very likely the exact same file, so associate the found user_files.file_id = files.id and user_files.user_id = users.id and set user_files.filename to the uploaded file name.
Note : this will cause to have 1 physical file and 2 virtual files on your server.
If a user rename a file without modifying it, simply rename user_files.filename matching the file he/she wants to rename.
If a user deletes a file, check how many user_files.file_id matches and only if 1 match is found, delete the physical file and the files entry. Otherwise, simply remove the user_files association.
If a user modifies the file with or without renaming it, perform a delete (step 5) and another save (step 3)

You can use :
md5(file_get_contents($filename));
To generate a hash for a file.
With that in mind, two entirely different files will produce the exact same md5 hash (Same problem with the other hashes, although you can have much less collisions by using a better hash method than md5). To compare two files you need to do it byte by byte, but you don't want to analyze every byte of every file on the hard disk to find a match.
What you need to do is store the hash for every file in your database in an a column, which should also be an index.
Then you can select all files with the same hash as the new file from your database.
That will give you a small list of files. Say you have 100,000 files on the disc. You might get a list of a few files that match the hash. Most of the time the lists will be short. Then you can loop through those files byte by byte to see if it's a match. Searching through a list of the ~10 files that have the same hash will save you from searching through all 100,000 files, but you still need to do the byte by byte comparison, because those 10 files could all be very different.

Is it necessary? Hard disk is very cheap these days so who cares for the duplicates? I would imagine that are not that big?
MD5 et al. are not unique. Just a quick way of saying that two files are not the same. It is possible for two files to have the same MD5 value but contain different data.

Related

sha1, crc32, and md5 how to read this data?

How can I Decode the md5, crc32, and sha1, below is xml file and then is code I'm using to get data so far.
<files>
<file name="AtTheInn-Germany-Morrow78Collection.mp3" source="original">
<format>VBR MP3</format>
<title>At the Inn - Germany - Morrow 78 collection</title>
<md5>056bbd63961450d9684ca54b35caed45</md5>
<creator>Germany</creator>
<album>Morrow 78 collection</album>
<mtime>1256879264</mtime>
<size>2165481</size>
<crc32>22bab6a</crc32>
<sha1>796fccc9b9dd9732612ee626c615050fd5d7483c</sha1>
<length>179.59</length>
</file>
And this is code I'm using to get title and album name how can I make sense of sha1 and md5, any help to any direction will be helpful, Thanks
<?php
$search = $_GET['sku'];
$catalogfile = $_GET['file'];
$directory = "feeds/";
$xmlfile = $directory . $catalogfile;
$xml = simplexml_load_file($xmlfile);
list($product) = $xml->xpath("//file[crc32 = '$search']");
echo "<head>";
echo "<title>$product->title</title>";

MD5, SHA-1, and CRC32 are hash functions. That means that they cannot be reversed.1 You'd have more luck looking into that name attribute of the file tag.
1 You can2 brute-force them, but since they can represent variable-length data as a fixed-length piece of data, due to the pigeonhole principle and just plain probability, you're more likely to get something that's not the original input than the original input.
2 It'll take forever for SHA-1, though.

Hash functions generate numbers that represent some arbitrary data. They can be used to verify if the data has changed (a good hash function should produce a totally different hash for even a single bit has changed).
Since you are turning an arbitrary amount of data in a number as a result you loose information, this means that it's hard to reverse them. Technically there is an infinite number of possible results for a hash as the data can be any length. For limited data sizes its still possible for there to be multiple data values for a specific hash, this is called a collision.
For some data sets (for example passwords) you can generate all possible combinations of data and check to see if they match a hash. If you do the generation at the same time as the checking it's known as 'brute forcing'. You can also store all possible combinations (for a limited range, for example all dictionary works or all combinations of characters under a specific size), then look it up. This is known as a rainbow table and is useful for reversing multiple hashes.
It's good practice to store passwords as a hash rather than in plain text but to ensure the passwords are hard to reverse they add a bit of random data to each one and store it along with the passwords, this is known as salting. This salt means it takes much longer to brute force a password.
In this case they are probably hashes of the mp3 file that is specified to verify file integrity and show any corruption that occurs during transfer (or storage). It won't be possible to reverse them since you would have to generate all possible combinations of megabytes of data. But if you have the file itself there wouldn't be any reason too. You can confirm they are hashes of the file by running a checksum generating program on it.

Image versioning - Best approach

I am developing a site framework in php (codeigniter) and want to introduce image versioning on image uploads so that I can take advantage of image caching. The easiest approach would just be to md5 the image and use that as the file name but I don't like this approach for the following reasons:
1)Not SEO friendly on the image names
2)md5 hashes seem unnecessarily long - and therefore larger database
field required.
So I am considering using an approach such as the following:
Start the filename with the entered name of the image with underscores instead of spaces then add a randomly generated integer, say 8 digits long. This will mean I have to check for an existing image by that name and then regenerate the integer if one exists (however unlikely that is).
Presumably I will also have to unique filename for every image size, so I guess the solution here would be to add a prefix representing the file size.
Now I want to get this right first time since it will be a pain to change once the framework is deployed so I am really just looking for input on
a)Whether my concerns are justified (particularly does the filename do
anything for SEO and does the length of a random string of numbers
affect it)
b)Whether there is anything else I should be concerned about or check
for with my proposed approach.
c)Is there an easier approach, perhaps a hashing algorithm which
produces much shorter results.
d) Is there already a ci lib out there that does this?
Thank you for your input and advice

This answers a few of your questions:
Replacing spaces with underscores is not enough to have a clean filename as you'd need to check for more weird characters, but you can use sanitize_filename() method in CI's security library: http://ellislab.com/codeigniter/user-guide/libraries/security.html
If you do want to preserve the original filename, your approach sounds good to me. Though, 8-digit integer at the end of filename can be replaced by '-1’, ‘-2’, ‘-3' by simple incremental loop checking if the file with that ending exists or not.
File Upload library is something you can check out - http://ellislab.com/codeigniter/user-guide/libraries/file_uploading.html. It is flexible and can be configured to keep the original filenames. Getting sanitize_filename() from Security lib to work along should do exactly what you need.
In all my CI applications I always use encrypted filename (this optional feature is provided by CI file upload class). At the same time I can configure the library to not overwrite already existing file by adding a number to it (if no encryption is used) or by just giving it another encrypted name (when encryption option is on). I do like it this way as it keeps the filenames consistent clean (although long and not SEO-friendly, however ALT tag gives it more exposure to search engines).

How do you convert crypt() hashes for use as file names?

I'm using crypt() which in the particular case uses an md5 hash with 12 character salt.
Here is an example of the string crypt() returns modified from php.net, crypt documentation.
$1$rasmusle$rISCgZzpwk3UhDidwX/in0
Here is the salt which also includes the encoding type.
$1$rasmusle$
Here it the encoding type. ( MD5 in this case )
$1$
and finally the hash value.
rISCgZzpwk3UhDidwX/in0
You can not have forward slashes in file names as this will be interpreted as a folder.
Should I simply remove all the forward slashes and are there other issue with the characters set that crypt() uses.

It looks like you want to prevent / allow access to the image for specific users. If that is the case I would do the following:
Store the images outside of the document root. This makes sure the images cannot simply be directly requested.
Store the images original name in the database and also store the sha1_file() hash in the same record. This adds the benefit if not having duplicate images on your server. Although images are small it prevents cluttering of the system.
When somebody requests a "private" image they will request it through a PHP file which will check whether the user has the privileges to access the file and if so serves the file (from the database).
With the above method you will have the most control over who can request the images and your users will thank you for that.
Note: that you cannot simply store all images in the same folder, because all filesystems have limits as to how many files can be stored in a single directory
A simple example of a PHP script that serves an image would look something like the following:
<?php
// always set the header and change it according to the type of the image
header("Content-type: image/jpeg");
echo file_get_contents('/path/to/the/image.jpg');

/$1$/ - Is an algorithm that used to create a hash
You can just use md5 md5_file/ sha1 sha1_file functions that would create hash without that additional information. Unless you want to use different algorithms at the same time.

Run a URLEncode method over your hash, and it should replace all of the '/' with %2F... I know this isn't a perfect fix, because i think things like apache server still block any web requests with '%2F' in the url. Just my 2 cents on the matter

ALWAYS normalize user provided data, including file names, unless you want to be hacked by uploading file with name containig NULL to fool PHP. Specify allowed characters (i.e A-Za-z0-9 and convert all other to i.e. underscore. Or use sha1/md5 to create hash from filename and store file under that name.
EDIT
This will replace all characters except for A-Z, a-z, 0-9 with underscore _:
$normalizedName = preg_replace('/[^A-Za-z0-9]/', '_', $userProvidedName);

generating random, single use URLs

I'm creating a pretty simple application which allows references (of potential employees) to upload their own reference letters. Here's how it works:
The applicant submits the references email
The reference receives an automatically generated email containing a unique URL (for security reasons)
Reference follows the link, answers security question (in case he wishes to access the site more than once)
Uploads letter
I'm stuck on how to generate a completely (okay, okay, quasi will do) random URL. What's more: how do I ensure that following the link will direct the references to the correct page? Do I have to create a new page containing the drop box every time I send out a random URL?
Thanks for any suggestions on how to go about this :)

You might try using a hash algorithm which generates the unique-esque checksum from the contents of the file. Usually (for example with md5()) one byte change in the original content results in a completely different hash. (Notice: md5 has some collision vulnerabilities.)
If you store the uploaded file with the filename of the hash, you will be able to retrieve it at a later date, but for more complex system, there should be a database set up which makes the relation between the random URL and the stored content.
If you don't want to hash, there code snippet below could help to generate random URLs (but make sure that if an URL is already used, you prevent accidental overwrites):
md5( sha1( time() + rand(0, time()) ) );

I presume that when you say that you want to generate a random URL, you are essentially asking to generate a random string. This is potentially very simple; here's some pseudocode:
for i = 1 to stringLength
randomString[i] = floor(random() * 26) + 'a'
end
In other words, generate a random number between 0 and 25, and add it to the ASCII value for the character 'a'. This would generate a random string of lowercase letters, which I think should be sufficient for your task. In PHP, you would use the rand function. It would be advisable to use the srand function to seed the random number generator with the current time, like the example at the end of the given link.
As for the second part; I recommend that you simplify things; rather than generating an actual page with a random URL, why not simply pass a random string into the query string such as:
www.mydomain.com/uploadReference.php?id=xxxxxxx
Where xxxxxxx is your random string. You can then verify the string and look it up in a database using PHP. This seems, for your purposes, by far the easiest way.

You can make a unique string based on some form of hash of the current time stamp or the reference's unique credentials (e.g. username or something). You could then create one page for the dropbox that would accept that unique string in the URL to be used for a script on the page which would retrieve the relevant data mapped by that string in a database.

You could also create a random permutation of numerals and characters so that
hash($previous) // is unique
The basic idea is that the 'hash'-function depends only on the previous value, creating a new unique value. For example so that '0' -> '1', '1' -> '2', '9' -> 'a', 'z' -> '10', 'z0' -> '11'. Such an algorithm is relatively easy to devise

encrypt / decrypt record id to 7 char in php

I am uploading files to my server and nameing them with record ID. because record id is in sequence, these files are not safe can be downloaded in loop. http://www.blabla.com/1.jpg .. 2.jpg etc.
I want to encrypt the record id to 7 char and while reading these files I want to dycrypt it back.
so file names would be
http://www.blabla.com/72ayhg6.jpg
which(72ayhg6) is when dycripted is id 1.
How can I do this using php.
Php decrypt and encrypt generate quite a long number. Can I added some sort of salt in it and limited it to 7 or 11 char.
thanks in advance.

Check this http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/

Why do you need to decrypt it?
Did you bother to read up on anti-leeching strategies.
A simple approach (though far from ideal, since its just security by obscurity) would be to rename the files based on a hash of the record id and a nonce.
Can I added some sort of salt in it and limited it to 7 or 11 char linux file name limit.
If you don't know the difference between Linux and MS-DOS, perhaps you need to cover some of the basics before attempting to write code?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.