Check if file is JPEG, PDF or TIFF - php

How would i check that a file is either jpeg, pdf or tiff? And I mean actually checking, not just from mime type and file extension.
I have access to the raw file data (this check is part of an uploader) and i need to verify that the files are either jpeg, pdf or tiff. I assume I would have to check for some sort of headers in the files but I have no idea what to look for and where to start.

Exif_imagetype is very useful for this: http://us2.php.net/manual/en/function.exif-imagetype.php
It scans the initial bytes of the file to determine the graphic type. It supports a large number of graphic formats (and returns false if it doesn't recognize the format).

You need to implement byte sequence tests.
Here is a guide to checking byte sequences for the most common image formats.

This can be tricky since all files must follow a certain kind of ISO standard with the "magical number" present, which basically is a "header" for the format.
I found this wiki-page about different signatures: http://en.wikipedia.org/wiki/List_of_file_signatures
So in the best case scenario you just need to validate these first bytes.

If you have access to the raw file, you can check the file header for its magic number. This number define the type of file.

to check for image types you can use the exif_imagetype function.
for pdf: you have to open the file and read the first bytes and look if it starts with '%PDF'
$fp = fopen($pdf, 'r');
if(fgets($fp, 4) == '%PDF')
{
... is pdf
}
fclose($fp);

There is no sure fired way to be certain but the first few binary bits of a file are its signature/fingerprint for the file handlers to test. see https://en.wikipedia.org/wiki/List_of_file_signatures
Every file type can vary considerably and some allow for variable / shifting headers, but with a degree of uncertainty (At one time PDF did not mandate the 40 bit signature to be first) we can assume the following hex values sometimes erroneously called "Magic Numbers" as representing the start of each bit stream.
So in general to answer the requested types
FF D8 (ÿØ) would be a Jpeg (EXCEPT JP2000=FF 4F or 00 00) in raw binary or /9j/4 in Base64 format
25 50 44 46 2d (%PDF-) would be the 40 bit signature of a PDF or JVBER in Base64 format
89 50 4E 47 (‰PNG) would be PNG in raw binary or iVBOR in Base64 format
just for good measure here is related older GIF sequence
47 49 46 38 (GIF8) and that's R0lGO as Base64 also we can see the first 8 bits are 01000111 for G
Thus in ALL the above cases just the first "8 bit / byte" would be a very good indicator, no need for Magic strings, but with Zip/###X such as docX pptX cbzX xlsX they ALL have the same Magic Number
50 4B (PK) base64 = UEsDB
Finally the last requested above was Tif(f) which can be two types, Intel or Motorola thus you need to test for
49 49 2A 00 (II* ) base64 = SUkqA
4D 4D 00 2A (MM *) base64 = TU0AK

Related

How do I detect the mime-type of an image that is already a resource variable?

Is it possible to determine the mime-type of an image resource, while it is still a resource variable? If I output the resource variable and use mime_content_type() or getimagesize(), its mime type is already set via whatever output function I use (imagejpeg(), imagepng() etc).
The reason I need to know this is to determine if the image may have transparency - if it was a JPEG, i know it can't, if it was a PNG or GIF, I know it potentially could have transparency.
Any advice would be appreciated!
I just googled for magic numbers for jpg, png and I found this site:
https://asecuritysite.com/forensics/magic
which states the following magic numbers these file types:
.jpg => FFD8
.gif => 47 49 46 38
.png => 89 50 4E 47
These numbers are the values of the first n bytes of the file which work as a signature of the file type. The values are expressed in hexadecimal.
By peeking into these values you can determine the type of the file.

Trouble with zlib compressed File

I have inherited a zlib compressed file and long story short, I need to UN-zlib-compress this puppy back to its original content.
I have been racking my brain trying to figure out what in the world is happening, but I am hitting a wall and I am hoping you good people will help me out to figure out what's going on.
I have done alot of things so far, I won't bore you with every single thing, but this is what I landed on last, and all I get garbled output, don't know what in the heck is wrong, especially that the last step of decode complains about the data saying:
Warning: gzuncompress(): data error in
C:\xampp\htdocs\test-box\index.php on line 6
Warning: zlib_decode():
data error in C:\xampp\htdocs\test-box\index.php on line 8
and this is the code - nothing fancy, I am trying to get it to work before going too crazy with it yet and so the simplicity should allow us to better analyze it.
<?php
$filename = 'c5ytvbg4y.x'; // this is the zlib compressed file
$file = filesize($filename); // using this for the length
$zd = gzopen($filename, "r"); // create valid pointer
$contents = gzread($zd, $file); // binary safe read the content
$decoded = gzuncompress($contents); // using gzdecode produces the same issue
gzclose($zd); // close the pointer
zlib_decode($decoded); // decode it but I get nothing but garble
?>
Any assistance would be appreciated. Ideally I want to be able to open it uncompress it back to normal and save it to a new file. But at the moment I would be happy just to find out why in the heck I get nothing but garbled text back. Also keep in mind that I know the $file above is not ideal, I will put a while !feof($zd) or something to that effect later, I wanted to keep it simple for now while trying get the larger issue figured out.
Any thoughts, recommendations, suggestions, code assistance, or whatnot would be greatly appreciated, TIA.
Additions
#Mark's Request:
0A 12 0F 04 04 D8 44 DA BF 63 C4 93 93 3B 49 51 17 A2 6F E3 0C 12 4D E4 24 F6 C8 BA D0 60 76 81
It is definitely not a "zlib compressed file", at least not the first 32 bytes, nor is it any format that uses the deflate compression method (e.g. gzip, zip, png, etc.), because there is no valid deflate compressed data in the provided bytes.
The zlib header typically starts with hexadecimal 78. Your data starts with 0A, which isn't valid as part of a zlib header. (Technically it is sort of valid, but it implies a compression format that isn't supported by any version of zlib.)
The gzip header starts with hexadecimal 1F 8B. That isn't present in your data either.
So, I'm not sure what this data is, but it's neither gzip nor zlib data. You'll need to do some more research to figure out what it is.

Get compressed byte size after zlib_decode()?

I'm trying to use PHP to parse a custom gzip archive file format that was created in Delphi (not my code!). The format is basically:
4-byte integer: count of files in archive
for each compressed file:
4-byte integer: filename length [n]
[n] bytes: filename
4-byte integer: uncompressed file length [m]
[????] bytes: gzipped content
I can read the file and actually decode the first compressed file correctly by using zlib_decode() with a max uncompressed length of [m] bytes on the remainder of the file after I know the length ([m]), but then I'm stuck because I don't know how far into the substring I should go to find the next filename -- zlib_decode() doesn't return the number of compressed bytes that it processed before stopping. Since this is a custom format, it doesn't seem like I can use the normal gzopen()/gzread() functions because the entire file isn't compressed (I tried, it doesn't work).
This code works in Delphi because apparently you can pass a file handle back and forth between normal file reading functions and the System.ZLib decoding functions -- you can read [m] uncompressed bytes and the pointer will remain at the last compressed byte -- but PHP doesn't seem to support switching between read-as-normal and read-as-gzip on the fly that way.
Am I missing an obvious way in PHP to deal with a mixed-content file format like this, where metadata and compressed data are stacked together this way? Or am I out of luck without knowing the compressed data length?
A dirty workaround is to recompress the content of each file as I am able to parse it, use that to calculate the compressed length, and adjust the file pointer in the original file manually as follows:
$current_pos = ftell($handle);
$skip_length = strlen(gzencode($uncompressed_text,9,FORCE_DEFLATE));
fseek($handle, $skip_length+$current_pos);
This works, but feels very hack-ish. I'd still be open to any better approaches.
EDIT:
Just a note that this eventually failed. However, I was fortunate enough to know in advance the list of expected filenames and I was able to do the following (more reliable since zlib_decode() will decode as much as it can and discard the rest anyway):
foreach ($filenames as $thisFilename) {
$thisPos = strpos($rawData, $thisFilename);
$gzresult = zlib_decode(substr($rawData, $thisPos + strlen($table) + 8)); // skip 8 bytes for filename size and uncompressed data size, which are useless info.
}

Parsing GIF application's extension blocks- how to find block size?

I am parsing a GIF 89a (yes, I need to) file and I am stuck on Application Extension blocks.
They have 13 byte header (including the beginning 21 FF 0B bytes) and then there is some data. How much data is there? How do I know know much to read?
You can skip the section below if you know the answer and just tell me :)
This page says:
ApplicationData contains the information that is used by the software application. This field is structured in a series of sub-blocks identical to the data found in a Plain Text Extension block."
Each sub-block begins with a byte that indicates the number of data bytes that follow. From 1 to 255 data bytes may follow this byte. There may be any number of sub-blocks in this field.
This way I can parse NETSCAPE 2.0 blocks which are:
03 01 00 00 00
so I have a loop in PHP:
for (;;)
{
$size = ord(fread($handle, 1));
if ($size == 0) break;
fseek($handle, $size);
}
or the same in Delphi, if you prefer:
while F.Position < F.Size do begin
F.Read(Size, 1); // F is TFileStream
if Size = 0 then break;
F.Position := F.Position + Size;
end;
The iteration goes:
size = read 1 byte; //size = 3;
read 3 byte;
size = read 1 byte;
size = 0 so break
So far, so good, here comes the problem: the XMP Data
So the bytes in this block go like this (ASCII below):
21 FF 0B 58 4D 50 20 44 61 74 61 58 4D 50
!`.XMP DataXMP
and then goes ASCII XML dump:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
for about 500 bytes.
I obviously can't read it the same way I read NETSCAPE 2.0 blocks.
It seems to be terminated with 00 byte.
Should it just always read until 00 byte? Then if would fail on NETSCAPE 2.0 blocks!
How should a GIF decoder behave on Application Extension blocks? How much data is in them?
Problematic XMP Data image
Ok- the NETSCAPE 2.0 block approach might be fine and it was failing on the XML because my file could be corruptly read.

How do I tell if someone's faking a filetype? (PHP)

I'm programming something that allows users to store documents and pictures on a webserver, to be stored and retrieved later. When users upload files to my server, PHP tells me what filetype it is based on the extension. However, I'm afraid that users could rename a zip file as somezipfile.png and store it, thus keeping a zip file on my server. Is there any reasonable way to open an uploaded file and "check" to see if it truly is of the said filetype?
Magic number. If you can read first few bytes of a binary file you can know what kind of file it is.
Check out the FileInfo PECL extension for PHP, which can do the MIME magic lookups for you.
Sort of. Most file types have some bytes reserved for marking them so that you don't have to rely on the extension. The site http://wotsit.org is a great resource for finding this out for a particular type.
If you are on a unix system, I believe that the file command doesn't rely on the extension, so you could shell out to it if you don't want to write the byte checking code.
For PNG (http://www.w3.org/TR/PNG-Rationale.html)
The first eight bytes of a PNG file always contain the following values:
(decimal) 137 80 78 71 13 10 26 10
(hexadecimal) 89 50 4e 47 0d 0a 1a 0a
(ASCII C notation) \211 P N G \r \n \032 \n
If you are only dealing with images, then getimagesize() should distinguish a valid image from a fake one.
$ php -r 'var_dump(getimagesize("b&n.jpg"));'
array(7) {
[0]=>
int(200)
[1]=>
int(200)
[2]=>
int(2)
[3]=>
string(24) "width="200" height="200""
["bits"]=>
int(8)
["channels"]=>
int(3)
["mime"]=>
string(10) "image/jpeg"
}
$ php -r 'var_dump(getimagesize("/etc/passwd"));'
bool(false)
A false value from getimagesize is not an image.
Many filetypes have "magic numbers" at the beginning of the file to identify them, You can read some bytes from the front of the file and compare them to a list of known magic numbers.
For an exact answer on how you could quickly do this in PHP, check out this question: How do I find the mime-type of a file with php?
As a side note I ran into a similar problem where I had to do my own type checking. The front end interface to my application was done in flash. The files were being passed through flash to a php script. When I was attempting to do a MIME type check using php the type always returned was application/octetstream because it was coming from flash.
I had to implement a magic numbers type paradigm. I simply created an xml file that held the file type along with some defining patterns found within the beginning of the file. Once the file reached the server I did some pattern matching with the xml file and then accepted or rejected the file. I didn't noticed any real performance decrease either which I was expecting.
This is just a side note to anyone who may be using flash as there front end and trying to type check the file once it is uploaded.
As well as identifying the filetype, you might want to watch out for files with other files embedded or appended to them. This will unfortunately require a more indepth analysis of the file contents than just using "magic numbers".
For example, http://quantumrook.wordpress.com/2007/06/06/hide-a-rar-file-in-a-jpg-file/ (this particular type of data hiding can be easily worked around by loading and resaving into a new file the actual image data .. others will be more difficult.)
On a unix system, capturing the output from the 'file' command should provide adequate info.

Categories