Programatically changing hash of a file without corrupting it - php

Does anyone have any info on changing the hash of a file without corrupting it?
I read about appending a null byte to the end of the file, thus changing the MD5 without corrupting it. Anyone have any info?
The language I wish to do this in is PHP.
Thanks.

It depends on exactly what the applications expect when they read this file. If, for example, it's a text file, you could simply insert a space following one of the paragraphs. This doesn't change the readability of the file by humans but it will change the MD5.
Likewise for basic HTML files or source files such as C or PHP where the spacing doesn't matter (as long as you insert the space in a syntactically insignificant area, so not inside string constants for example) . Put in some extra spaces or add newline characters at the end and you'll find the behavior of your web pages doesn't change.
However this is unlikely to work for an executable file since it will probably crash and burn when you run it (if indeed it even loads - some loaders may use checksums for the load sections).
You need to specify exactly what corruption means in the case you're talking about.
Update:
For example, in JPEG files, it's probably a simple matter of replacing the EOI marker at the end with a unique COM section followed by an EOI marker. The EOI marker is the end of image and you should be able to insert the comment section (with a unique comment) before it. This would make each JPEG have a different MD5 while stil presenting the same image. See here.
With ZIP files, you can actually insert arbitrary data in between each file since the catalog at the end lists files with their offsets. See here for details. Unfortunately, I'm not familiar with the internals of RAR files.

Sounds like you might be better off just changing those duplicate files to symbolic links ln -s otherfolder/file file (assuming the server is on a *nix platform).

If you are primarily dealing with .ZIP and .RAR files, find a ZIP/RAR library for PHP, and simply add a tiny random file to every zip/rar.
For JPEGs, follow paxdiablo's answer.

Related

Store images in directories and store a reference to the images in the database

I am currently involved in a project to create a website which allows users to share and rate images, for their creative media course.
I am trying to find ways to save images to a mysql database. I know i can save images as blobs, but this won't work as i plan on only allowing users to save high res images. Therefore, i've tried to find out how to store images in a directory/server folder and store references to the images in the database. An added complication to he matter, is that the reference must automatically save within a mysql database table.
Does anyone know how to go about this? or point me in the right direction?
Thanks!
I've actually built a similar website (mass image uploader) so I can speak from experience.
Keeping track of the files
Save the image file as-is on disk and save the path to the file in the database. This part should be pretty straightforward.
One disadvantage is that you need a database lookup for every image, but if your table is well optimized (indexes) this should be no real problem.
There are many advantages, such as your files become easily referable and you can add meta data to your files (like number of views).
Filenames
Now, saving files, lots of files, is not immediately straightforward.
If you don't care at all about filenames just generate a random hash like:
$filename = md5(uniqid()); // generate a random hash, mileage may vary
This gets ride of all kind of filename related issues like duplicate filenames, unsupported characters etc.
If you want to preserve the filename, store the filename in the database.
If you want your filename on disk to also be somewhat human readable I would go for a mixed approach: partly hash, partly original filename. You will need to filter unsupported characters (like /), and perhaps transliterate similar characters (like é -> e and ß -> ss). Foreign languages such as Chinese and Hebrew can give interesting results, so be aware of that. You could also encode any foreign character (like base64_encode) but that doesn't do much for readability.
Finally, be aware of filepath length constraints. Filenames and filepaths can not be infinitely long. I believe Windows is 255 for the full path.
Buckets
You should definitely consider using buckets because OSes (and humans) don't like folders with thousands of files.
If you're using hashes you already have a convenient bucket scheme available.
If your hash is 0aa1ea9a5a04b78d4581dd6d17742627
Your bucket(s) can be: 0/a/a/1/e/a9a5a04b78d4581dd6d17742627. In this case there are have 5 nested buckets. which means you can expect to have one file in each bucket after 16^5 (~1 million) files. How many levels of buckets you need is up to you.
Mime-type
It's also good to keep track of the original file extension / mime-type. If you only have one kind of mime-type (like TIFF) then you don't need to worry about it. Most files have some way to easily detect that it's a file in that format but you don't want to have to rely on that. PNGs start with PNG (open one with a text editor to see it).
Relative path vs absolute path
I would also recommend saving the relative path to the files, not the absolute path. This makes maintenance much easier.
So save:
0/a/a/1/e/a9a5a04b78d4581dd6d17742627
instead of:
/var/www/wwwdata/images/0/a/a/1/e/a9a5a04b78d4581dd6d17742627

PHP File upload security - keeping the original file name

I want to allow registered users of a website (PHP) to upload files (documents), which are going to be publicly available for download.
In this context, is the fact that I keep the file's original name a vulnerability ?
If it is one, I would like to know why, and how to get rid of it.
While this is an old question, it's surprisingly high on the list of search results when looking for 'security file names', so I'd like to expand on the existing answers:
Yes, it's almost surely a vulnerability.
There are several possible problems you might encounter if you try to store a file using its original filename:
the filename could be a reserved or special file name. What happens if a user uploads a file called .htaccess that tells the webserver to parse all .gif files as PHP, then uploads a .gif file with a GIF comment of <?php /* ... */ ?>?
the filename could contain ../. What happens if a user uploads a file with the 'name' ../../../../../etc/cron.d/foo? (This particular example should be caught by system permissions, but do you know all locations that your system reads configuration files from?)
if the user the web server runs as (let's call it www-data) is misconfigured and has a shell, how about ../../../../../home/www-data/.ssh/authorized_keys? (Again, this particular example should be guarded against by SSH itself (and possibly the folder not existing), since the authorized_keys file needs very particular file permissions; but if your system is set up to give restrictive file permissions by default (tricky!), then that won't be the problem.)
the filename could contain the x00 byte, or control characters. System programs may not respond to these as expected - e.g. a simple ls -al | cat (not that I know why you'd want to execute that, but a more complex script might contain a sequence that ultimately boils down to this) might execute commands.
the filename could end in .php and be executed once someone tries to download the file. (Don't try blacklisting extensions.)
The way to handle this is to roll the filenames yourself (e.g. md5() on the file contents or the original filename). If you absolutely must allow the original filename to best of your ability, whitelist the file extension, mime-type check the file, and whitelist what characters can be used in the filename.
Alternatively, you can roll the filename yourself when you store the file and for use in the URL that people use to download the file (although if this is a file-serving script, you should avoid letting people specify filenames here, anyway, so no one downloads your ../../../../../etc/passwd or other files of interest), but keep the original filename stored in the database for display somewhere. In this case, you only have SQL injection and XSS to worry about, which is ground that the other answers have already covered.
That depends where you store the filename. If you store the name in a database, in strictly typed variable, then HTML encode before you display it on a web page, there won't be any issues.
The name of the files could reveal potentially sensitive information. Some companies/people use different naming conventions for documents, so you might end up with :
Author name ( court-order-john.smith.doc )
Company name ( sensitive-information-enterprisename.doc )
File creation date ( letter.2012-03-29.pdf )
I think you get the point, you can probably think of some other information people use in their filenames.
Depending on what your site is about this could become an issue (consider if wikileaks published leaked documents that had the original source somewhere inside the filename).
If you decide to hide the filename, you must consider the problem of somebody submitting an executable as a document, and how you make sure people know what they are downloading.

Fast access to files

I'm currently building an application that will generate a large number of images (a few tens of thousand of images, possibly more but not in the near future at least). And I want to be able to determine whether a file exists or not and also send it to clients over http (I'm using apache is my web server).
What is the best way to do this? I thought about splitting the images to a few folders and reduce the number of files in each directory. For example lets say that I decide that each file name will begin with a lower letter from the abc. Than I create 26 directories and when I want to look for a file I will add the name of the directory first. For example If I want a file called "funnyimage2.jpg" I will save it inside a directory called "f". I can add layers to that structure if that is required.
To be honest I'm not even sure if just saving all the files in one directory isn't just as good, so if you could add an explanation as to why your solution is better it would be very helpful.
p.s
My application is written in PHP and I intend to use file_exists to check if a file exists or not.
Do it with a hash, such as md5 or sha1 and then use 2 characters for each segment of the path. If you go 4 levels deep you'll always be good:
f4/a7/b4/66/funnyimage.jpg
Oh an the reason its slow to dump it all in 1 directory, is because most filesystems don't store filenames in a B-TREE or similar structure. It will have to scan the entire directory to find a file often times.
The reason a hash is great, is because it has really good distribution. 26 directories may not cut it, especially if lots of images have a filename like "image0001.jpg"
Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]
A directory on a unix file system is just a file that lists filenames and what inode contains the actual file data. As such, scanning a directory for a particular filename boils down to the equivalent operation of opening a text file and scanning for a line with a particular piece of text.
At some point, the overhead of opening that directory "file" and scanning for your filename will outweigh the overhead of using multiple sub-directories. Generally, this won't happen until there's many thousands of files. You should benchmark your system/server to find where the crossover point is.
After that, it's a simple matter of deciding how to split your filenames into subdirectories. If you're allowing only alpha-numeric characters, then maybe a split based on the first 2 characters (1,296 possible subdirs) might make more sense than a single dir with 10,000 files.
Of course, for every additional level of splitting you add, you're forcing the system to open yet another directory "file" and scan for your filename, so don't go too deep on the splits.
your setup is okay. Keep going this way
It seems that you are on the right path. Another post at ServerFault seems to confirm that you are doing the right thing.
I think linux has a limit to the amount of files a directory can contain; it might be best to split them up.
With your method, you can have the same exact image with many different file names. Also, you'll have more images that start with "t" than you would "q" so the directory would still get large. You might want to store them as MD5-HASH.jpg instead. This will eliminate duplicates and have a more even distribution over 36 directories.
Edit: Like Evert mentions, you can do a multi-level directory structure to keep the directory size even smaller.

List image files using PHP -- and be case-sensitive

A drop-box directory for image files has collected variants by letter-case, for example:
Bonsai.jpg, BONSAI.jpg, Bonsai.JPG, bonsai.jpg
I am making a web app using CodeIgniter to manage these documents on a remote server. This means using
file_exists() or is_file() to verify
a file's presence
HTML img tag to display the file graphically
But both these tools use the first match they find, regardless of case. How can I deal with this?
(I noticed this similar question as this, but for Delphi instead of PHP.)
But both these tools use the first match they find, regardless of case
They definitely shouldn't - at least not on a file system that is case sensitive, like Linux's default file system (is it still called ext2?). While it's questionable practice to have those four file in the same directory IMO, neither file_exists() nor the serving of web resources should show the behaviour you describe.
It's different on Windows: FAT and NTFS are not case sensitive. In your example, only one of the four files you mention can exist in the same directory.
When accepting images I always rename them, for example using CI's encrypt filenames option of the File Upload class to avoid these kind of problems. Otherwise it can turn in to a big headache.
EDIT: added my comment on the OP below
You can easily write a script that puts all filenames in to an array, identify duplicates and append _1 to their name. Now you have just unique filenames. Then you convert all to lowercase. For all existing files and new ones you encrypt the filenames to a 32 character string. Batch processing of filenames like this is actually quite easy. Just keep a back up of all files just in case, and very little can go wrong.
Codeigniter has some useful functions like the file helper's get_filenames() which puts all files in a specified directory in to an array, and the security helper's dohash() which would encrypt the filenames. For future uploads set encrypt_name preference to TRUE

Generating ZIP files with PHP + Apache on-the-fly in high speed?

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.
This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.
Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.
You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.
i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)
You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

Categories