PHP free file content analyzer library / function

PHP free file content analyzer library / function - php

I was wondering if you know about any good and accurate PHP library or file I can include in my script in order to analyse the content of X file and then check if it is an especific type like .doc, .docx .jpg, etc.
I know PHP offers a big number of libraries that we could use to check them, but they're not that accurate at all, some just checks the file extension or the file header (they don't even know if the file is broken or not)
What I request is for something very accurate, simple and faster (probably I'm requesting too much) but any link or suggestion will be accepted and appreciated, Thank you!

As far as I know, no such library exists; it also wouldn't make sense to have one.
let's say I have jpeg image I would like to analyse, the headers probably would be okay but the image itself is broken, and when I want to convert them or cut them for thumbnails (with the GD library which is the one I use) the functions (mostly imagecreatefromjpeg) will throw me errors, and in order to create a good thumbnail I need a valid image.
The best place to catch a malformed JPG file with malformed headers is when GD errors out while trying to process it. Just deal with that in a transparent and useful way (= let the user know that something went wrong). Why add extra code that would essentially have to do the same thing?
By handling the error when it occurs, you can also catch issues that a simple analysis of the file wouldn't reveal anyway - for example, GD can't deal with CMYK JPGs. Still, CMYK JPGs are perfectly valid files. Another example is files that are too big to be processed on your server.
Of course, you can do header or size checks beforehand on every uploaded file. But a separate check that goes as deeply as you want it doesn't make sense.
Apart I would like to have it to prevent virus or code injection..
This isn't a realistic goal. What if the library you open the file with to check it is vulnerable to the injection?
Also, injections like this are very rare; library vulnerabilities tend to be widely publicized, and patches quickly provided. Just keep your machine up to date.
If you really need enterprise-grade virus protection, get a server-side virus detection product.

What i did for this was to open the file, read it, and search for the file headers. most of them are available in their wikipedia format definition.
%PDF for pdf, first 4 chars.
%PNG for png, first 4 chars.
Havent seen yet a library to do that.

Related

PHP Imagick reinterpretation of PNG IDAT chunks

I noticed that PHP Imagick changes the IDAT chunks when processing PNGs.
How exactly is this done? Is there a possibility to create IDAT chunks that remain unchanged? Is it possible to predict the outcome of Imagick?
Background information to this questions:
I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs:
$image = new Imagick('uploaded_file.png');
$image->stripImage();
$image->writeImage('secure_file.png');
Comments are stripped out, so the only way to bypass this filter is hiding the PHP payload in the IDAT chunk(s). As described here, it is theoretically possible but Imagick somehow reinterprets this Image data even if I set Compression and CompressionQuality to the values I used to create the PNG. I also managed to create a PNG whose ZLIB header remained unchanged by Imagick, but the raw compressed image data didn't. The only PNGs where I got identical input and output are the ones which went through Imagick before. I also tried to find the reason for this in the source code, but couldn't locate it.
I'm aware of the fact that other checks are necessary to ensure the uploaded file is actually a PNG etc. and PHP code in PNGs is no problem if the server is configured properly, but for now I'm just interested in this issue.

IDAT chunks can vary and still produce an identical image. The PNG spec unfortunately forces the IDAT chunks to form a single continuous data stream. What this means is that the data can be grouped/chunked differently, but when re-assembled into a single stream will be identical. Is the actual data different or is just the "chunking" changed? If the later, why does it matter if the image is identical? PNG is a lossless type of compression, stripping the metadata and even decompressing+recompressing an image shouldn't change any pixel values.
If you're comparing the compressed data and expecting it to be identical, it can be different and still yield an identical image. This is because FLATE compression uses an iterative process to find the best matches in previous data. The higher the "quality" number you give it, the more it will search for matches and shrink the output data size. With zlib, a level 9 deflate request will take a lot longer than the default and result in slightly smaller output data size.
So, please answer the following questions:
1) Are you trying to compare the compressed data before/after your strip operation to see if somehow the image changed? If so, then looking at the compressed data is not the way to do it.
2) If you want to strip metadata without any other aspect of the image file changing then you'll need to write the tool yourself. It's actually trivial to walk through PNG chunks and reassemble a new file while skipping the chunks you want to remove.
Answer my questions and I'll update my answer with more details...

I wondered whether the following code (part of a PHP file upload) can prevent hiding PHP code (e.g. webshells) in PNGs
You should never need to think about this. If you are worried about people hiding webshells in a file that is uploaded to your server, you are doing something wrong.
For example, serving those files through the PHP parser....which is the way a webshell could be invoked to attack a server.
From the Imagick readme file:
5) NEVER directly serve any files that have been uploaded by users directly through PHP, instead either serve them through the webserver, without invoking PHP, or use readfile to serve them within PHP.
readfile doesn't execute the file, it just sends it to the end-user without invoking it, and so completely prevents the type of attack you seem to be concerned about.

On google app engine, how do i stop users uploading malicious files renamed to something friendly?

I am working on an app where users can upload images and iv'e been setting it up on Google App Engine which i think is fantastic so far. But I'm having trouble figuring our the best way to validate a users upload as a proper image, I cant see anything in their documentation about it and the search results on Google are very few and far between so i'm wondering if i am headed down the wrong path with this.
Basically I don't want to store files that aren't images (cost and consistency reasons mainly as well as protecting my users) so I need a valid way to determine that, I was thinking of using the cloud storage tools API to get the content type but i'm wondering if that is just going to be based off the file extension, because the file type that the GCS upload url gave me back was a 'image/jpeg' when i renamed a .exe to a .js
I really feel like i'm missing something here, has anyone else come across this issue on Google app engine yet?

The GD extension is available, so you could call getimagesize on the file.
If it gives you a valid size then assume it's an image, otherwise it is not.

To determine the file type, you will need a mechanism to inspect the first few bytes of the file that will typically contain the magic number. The magic number is a way to determine the file type. Take a look at various magic numbers for popular file extensions : http://billatnapier.wordpress.com/2013/04/22/magic-numbers-in-files/

Similar question with solution:
For python: How to check type of files without extensions in python?
For PHP (see two last post): Best way to recognize a filetype in php

Use exif_imagetype. Based on the docs, Its faster than getimagesize.

What is the final say on handling image uploads with php?

After reading a lot of articles. I would say, so what should I actually do to secure my site from hack attempts via the file upload?
From these links:
This link says that MIME IS USELESS and that EXTENSION IS THE WAY TO GO. But in the end the 2 parties are just arguing and if I'm correct BOTH agreed to say that both MIME or EXTENSION has a security hole. A lot of hate over there.
This link agrees to say that MIME is also useless AND EXTENSION is also just not FOOL PROOF as HTML or JAVASCRIPT code can be inserted in a GIF image file (or others) and can be misinterpreted by IE leading to a quick backdoor entrance for malicious code(I really wish everyone would just vote to stop the use of IE. Its like it was made to use as a hacking browser.)
This link says to give the file a NON-EXECUTABLE PERMISSION so that no-matter what it is it wont run (but would this protect us from xss/html/javascript/etc. embedded in the images like the one mentioned in the 2nd statement? If giving the file a non-executable permission would protect us from those embedded threats. Would it also protect us from other threats? Are there other forms of hack that can bypass this approach?)
And then there's this link that says "Re-process the image" other methods are just "fun boring for hackers.". Which is kind of in a way a solid way of identifying if the IMAGE is an IMAGE(IMO, cause imagick wont convert a non image right? Not sure. Haven't dive into it yet. Looked deep).
So what is the best and secure way to protect our sites from file upload threats?
If we check for all:
VALID MIME TYPE
VALID EXTENSION
GETIMAGESIZE() CHECK
ENSURE NON-EXECUTABLE PERMISSIONS
REPROCESS THE IMAGE
Would that be enough? For a SAFE SECURE Image File Upload?

mime-type is easy to fake, file extension is easier to fake. Use them if you need a clue on what the file type is, assuming the user is a good guy. Don't rely on it.
My point exactly
Give the file non executable permissions is a good idea. It is useless from a web security point of view. Are your .php files executables? No. Are they still processed by the web server? Yes.
This is the way to go. Open the file with imagick for example. If imagick complains about the file format, then don't keep it.

Risks of a php image upload form? [duplicate]

This question already has answers here:
Security threats with uploads
(4 answers)
Closed 9 years ago.
So I have a client wants a photography site where users can upload their photos in response to photography competitions. Though technically this isn't a problem, I want to know the risks associated with allowing any user to be able to upload any image onto my server. I've got the feeling the risks are high...
I was thinking of using something like this http://www.w3schools.com/php/php_file_upload.asp
If I do let anonymous users upload files, how can I secure the directory the images (and potentially damaging files) will be uploaded into?

if you want to be sure that the image is a real image you can load using gd http://www.php.net/gd
if the gd resource is created correctly then the image is a real image
first detect the mime using:
getimagesize($filename);
then, for example if it is a jpeg load into gd:
$gdresource = imagecreatefromjpeg($filename);
if $gdresource is valid/created without warnings, the image is valid and not corrupted... getimagesize() is (probably) not good enough to detect corrupted images
also, another important note... don't rely on $_FILES['blabla']['name'] because it could contain non valid utf-8 sequences (assuming that you are using utf-8 for example) and it could be a potential attack mechanism, as any user input
so you'll need to validate / sanitize that as well
$originalFileName = $_FILES['blabla']['name'];
$safeOriginalFileName = iconv('UTF-8', 'UTF-8//IGNORE', $originalFileName);
// more additional checks here. for example filename is empty ""
move_uploaded_file(...., $safeOriginalFileName);
also, remember that $_FILES['blabla']['name'] contains the file extension, which may not be correct. so you'll need to strip it out and use the actual correct extension (that you previously resolved using getimagesize() + imagecreatefrom*())
$safeOriginalFileName = basename( $safeOriginalFileName ); // removes the extension
$safeOriginalFileName = $safeOriginalFileName . ".jpg"; // correct extension
hope this helps :)
also as DaveRandom pointed out, don't rely also on $_FILES['blabla']['type'], use instead as I suggested getimagesize() + imagecreatefrom*()

The uploaded file is stored at a temporary location, this location can be found in the $_FILES variable.
When your script accepts the uploaded file, you can use move_uploaded_file() to move it to the location of your choice.
So even the user is anonymous you are in control what to do with uploads and whether to accept them (eg based on content, size, etc.) or not.
Furthermore, the (anonymous) user provides the file and accompanied details. So, if you blindly use these details, your are vulnerable (a user with bad intents is probably providing the wrong details, to make it legit). So, if you need these details, gather them yourself (instead of using $_FILES)!
For more information see the PHP documentation

You will have to research a bit, but mainly these are the main hints:
The basic security you can have is to check actually the image's MIME type and extension. Although this is certainly easy to forge.
Use binary safe functions like readfile(), fopen() and file_get_contents(), I don't remember exactly which ones but there was a few php functions that had security issues handling files, research which ones are and avoid them.
There are some functions out there using preg_match() and similar that will check if there's something similar to a script in the file you are reading. Use them to make sure there isn't hidden scripts. This will slowdown the process a bit as preg_match() can be resource expensive reading big files but it shouldn't be very noticeable
You could also trigger an antivirus to run on the files uploaded as the email services do.
As far as I know the potentially damaging images would normally contain scripting languages, like php code or javascript to try XSS attacks, there are a lot of dangers out there, so I guess you can't guarantee 100% de safety of the files, but keep having a look periodically to see all the new dangers and ways to avoid them.

How does Facebook do it? Checking the file type? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check file types of uploaded files in PHP?
Creating a text file and rename it to anything.jpg and try uploading it on facebook, facebook detects that the file is not an image and says Please select an image file or something like that. How do they do it?
I tested it out on my localhost by creating a dummy html form along with a <input type="file"... element and uploaded an image file created by renaming a text file to something.jpg and the file type in $_FILES['control_name']['type'] showed image/jpeg... How do I block users from uploading such 'fake' images. I think restriction using $_FILES['control_name']['type'] is not a solution, right?

When you process image on server, use image manipulation library (getimagesize for example) to detect it's width and height. When this fails, reject the image. You will probably do it anyway to generate thumbnail, so it is like one extra if.

There are many ways of checking the actual files. How Facebook does it, only the ones who created it know i think :).
Most likely they will look at the first bytes in the file. All files have certain bytes describing what they truely are. For this however you need loads of time/money creating a database or such against which you can validate the uploads.
More common solutions are;
FORM attribute
In a lot of browsers, of course excluding Internet Explorer, you can set an accept attribute which checks on extensions client side. More info here: File input 'accept' attribute - is it useful?
Extension
This is not realy secure, for a script can be saved with an image extension
Read file MIME TYPE
This is a solution like you stated in your question. This however is also easy to bypass and relies on the up-to-date status of your server.
Processing the image
The most reliable (for most developer skills and available time) would be to process the image as a test.
Put it in a library like GD or Imagic. They will raise errors when an image is not realy an image. This however will require you to keep that software up to date.
In short, there is not a 100% guarantee to catch this without spending tons of hours. Even then you only get 99,9%. You should weigh your available time against the above options and choose which best suits you. As best practice i recommend a combination of all 3.
This topic is also discussed in Security: How to validate image file uploads?

Headers in your file won't be the same.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.