Determine file type of a string - php

PHP ships with various methods of identifying the type of a file, but is it possible to identify a data type when the file in question only exists as a binary string representation and not as an actual file on disc?
The reason for this is I'm doing some maintenance work on a CMS where the previous developer, being a bit of a wally, decided to store image data into the system as database BLOBs. My current project is dumping the BLOBs out into files, and saving the path to the files into the database in place of the BLOBs.
As I said, my predecessor was a bit of a wally and not only did he store all this data as BLOBs, he also didn't save the datatype of the data anywhere.
The migration utility I wrote for part of this project saves the file to disc without an extension, tries to determine the type of the file with exif_imagetype() and if it manages to identify the file type, renames the file with the correct extension.
However, the classes that use the image data also need updating so they can continue to function with paths and files on disc instead of BLOBs.
The methods that create and update images expect binary strings (to BLOB into the database) and in an ideal world I'd rather rewrite these methods to use is_uploaded_file, move_uploaded_file, etc. However, there's no evidence anywhere in the class of direct manipulation of the $_FILES array so the filedata obviously comes from outside the classbut given how convoluted the code is (and no comments to help out) I can't find it.
As a stopgap solution until I finally track down the actual file upload management code, I plan to manipulate the file data as strings in the class as is currently done, but saving the strings to files instead of into the database. This should minimize the impact on other parts of the codebase that are relying on this class.
I could just do what the migration script is doing and rename the file after saving and then identifying it, but this could prove problematic in the case where there is already a file there. I'd rather know what the data type is before I commit the data to disc.

finfo_buffer() is what you want. You can pass it your string and it will tell you what the file type is based on your mime.magic file.
More info here: http://us3.php.net/manual/en/function.finfo-buffer.php

You can use finfo_buffer. It works on strings rather than on-disk files.

If you are working only with images, you can find the filetype looking in the first few bytes of the string - PNG files begin with 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A, GIF - with GIF, and JPEG contains FFD8 in the header. I've wrote script for parsing headers of those 3 types, but since I don't have it here, I'll update my answer as soon as I get it

Related

PHP - parsing contents of excel/pdf file already retrieved and stored in variable, without having to save contents to a file on disk [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Here is the scenario:
I have a variable in php that has raw contents of an excel file and I want to parse the contents of that variable (which is in an excel format, or can also be in a pdf format) for a certain value. I am looking for a keyword near the end of the contents of the file and will need to extract some of the contents near the desired value inside the contents of the file so I can get it into a variable in php and output to my webpage. From what I know the file is in binary, or hex representation but the ascii conversion is represented as readable text with diamond characters (with a question mark) and rectangles with a border and other extraneous characters including readable text content.
Here are the requirements:
I don't want parse the contents of the file by first storing or saving on disk. I want to parse the contents of the retrieved file directly while in a php variable.
Here is my question:
How do I go about this? Should I rely upon PHPExcel to read this content if possible? If not, what php libraries can accomplish this task?
Should I rely upon PHPExcel to read this content if possible?
It is not possible (see below).
If not, what PHP libraries can accomplish this task?
None that I know of.
How do I go about this?
An Excel file (rather, an Excel 2003+ XLSX file - Excel97 XLS files are a wholly different can of worms) is a ZIP archive containing XML and other files in a tree structure. So your first stage is to decompress a ZIP file in a string; PHPExcel relies on the ZipArchive class, and this, in turn, does not support string reading and also bypasses most stream hacks. A similar problem - actually exactly the same problem - is described in this question.
You could think of using stream wrapping to decode the file from a string, and the first part - the reading - would work. The writing of the files would not. And you cannot modify the ZipArchive class so that it writes to a memory object, because it is a native class.
So you can employ a slight variation, from one of the answers above (the one by toster-cx). You need to decode the ZIP structure yourself, and thus get the offset in the ZIP file where the file you need begins. This will either be /xl/worksheets/sheet1.xml or /xl/sharedStrings.xml, depending on whether the string has been inlined by Excel, or not. This also assumes that the format is the newer XLSX. Once you have that, you can extract the data from the string and decompress it, then search it for the token.
Of course, a more efficient use of the time would be to determine exactly why you don't want to use temporary files. Maybe that problem can be solved another way.
Speed problem
Actually, reading/writing an Excel file is not so terrible, because in this case you don't need to do that. You can almost certainly consider it a Zip file, and open it using ZipArchive and getStream() to directly access the internal sub-file you're interested in. This operation will be quite fast, also because you can run the search from the getStream() read cycle. You do need to write the file once, but nothing more.
In fact, chances are that you can write the file while it is being uploaded (what do you use for Web upload? The plupload JS library has a very nice hook to capture very large files one chunk at a time). You still need a temporary area on the disk where to store the data, but in this case the time expenditure will be exclusively dedicated to the decompression and reading of the XML sub-file - the same thing you'd have needed to do with a string object.
It is also (perhaps, depending on several factors, mainly the platform and operating system) possible to offload this part of the work to a secondary process running in the background, so that the user sees the page reload immediately, while the information appears after a while. This part, however, is pretty tricky and can rapidly turn into a maintenance nightmare (yeah, I do have first-hand experience on this. In my case it was tiled image conversion).
Cheating
OK, fact is I love cheating; it's so efficient. You say that you control the XLSX and PDF being created? Well! It turns out that in both cases, you can add hidden metadata to the file. And those metadata are much more easily read than you might think.
For example, you can add zip archive comments to a XLSX file, since it is a Zip file. Actually you could add a fake file with zero length to the archive, call it INVOICE_TOTAL_12345.xml, and that would mean that the invoice total is 12345. The advantage is that the file names are stored in the clear inside the XLSX file, so you can just use preg_match and look for INVOICE_TOTAL_([0-9]+)\.xml and retrieve your total.
Same goes for PDF. You can store keywords in a PDF. Just add a keyword attribute named "InvoiceTotal" (check the PDF to see how that turns out). But there is also a PDF ID inside the PDF, and that ID will be at the very end of the PDF. It will be something like /ID [<ec144ea3ecbb9ab8c22b413fec06fe29><ec144ea3ecbb9ab8c22b413fec06fe29>]^, but just use a known sequence such as deadbeef and ec144ea3ecbb9ab8c22deadbeef12345 will, again, mean the total is 12345. The ID before the known sequence will be random, so the overall ID will still be random and valid.
In both cases you could now just look for a known token in the string, exacly as requested.

Store images in directories and store a reference to the images in the database

I am currently involved in a project to create a website which allows users to share and rate images, for their creative media course.
I am trying to find ways to save images to a mysql database. I know i can save images as blobs, but this won't work as i plan on only allowing users to save high res images. Therefore, i've tried to find out how to store images in a directory/server folder and store references to the images in the database. An added complication to he matter, is that the reference must automatically save within a mysql database table.
Does anyone know how to go about this? or point me in the right direction?
Thanks!
I've actually built a similar website (mass image uploader) so I can speak from experience.
Keeping track of the files
Save the image file as-is on disk and save the path to the file in the database. This part should be pretty straightforward.
One disadvantage is that you need a database lookup for every image, but if your table is well optimized (indexes) this should be no real problem.
There are many advantages, such as your files become easily referable and you can add meta data to your files (like number of views).
Filenames
Now, saving files, lots of files, is not immediately straightforward.
If you don't care at all about filenames just generate a random hash like:
$filename = md5(uniqid()); // generate a random hash, mileage may vary
This gets ride of all kind of filename related issues like duplicate filenames, unsupported characters etc.
If you want to preserve the filename, store the filename in the database.
If you want your filename on disk to also be somewhat human readable I would go for a mixed approach: partly hash, partly original filename. You will need to filter unsupported characters (like /), and perhaps transliterate similar characters (like é -> e and ß -> ss). Foreign languages such as Chinese and Hebrew can give interesting results, so be aware of that. You could also encode any foreign character (like base64_encode) but that doesn't do much for readability.
Finally, be aware of filepath length constraints. Filenames and filepaths can not be infinitely long. I believe Windows is 255 for the full path.
Buckets
You should definitely consider using buckets because OSes (and humans) don't like folders with thousands of files.
If you're using hashes you already have a convenient bucket scheme available.
If your hash is 0aa1ea9a5a04b78d4581dd6d17742627
Your bucket(s) can be: 0/a/a/1/e/a9a5a04b78d4581dd6d17742627. In this case there are have 5 nested buckets. which means you can expect to have one file in each bucket after 16^5 (~1 million) files. How many levels of buckets you need is up to you.
Mime-type
It's also good to keep track of the original file extension / mime-type. If you only have one kind of mime-type (like TIFF) then you don't need to worry about it. Most files have some way to easily detect that it's a file in that format but you don't want to have to rely on that. PNGs start with PNG (open one with a text editor to see it).
Relative path vs absolute path
I would also recommend saving the relative path to the files, not the absolute path. This makes maintenance much easier.
So save:
0/a/a/1/e/a9a5a04b78d4581dd6d17742627
instead of:
/var/www/wwwdata/images/0/a/a/1/e/a9a5a04b78d4581dd6d17742627

PHP File upload security - keeping the original file name

I want to allow registered users of a website (PHP) to upload files (documents), which are going to be publicly available for download.
In this context, is the fact that I keep the file's original name a vulnerability ?
If it is one, I would like to know why, and how to get rid of it.
While this is an old question, it's surprisingly high on the list of search results when looking for 'security file names', so I'd like to expand on the existing answers:
Yes, it's almost surely a vulnerability.
There are several possible problems you might encounter if you try to store a file using its original filename:
the filename could be a reserved or special file name. What happens if a user uploads a file called .htaccess that tells the webserver to parse all .gif files as PHP, then uploads a .gif file with a GIF comment of <?php /* ... */ ?>?
the filename could contain ../. What happens if a user uploads a file with the 'name' ../../../../../etc/cron.d/foo? (This particular example should be caught by system permissions, but do you know all locations that your system reads configuration files from?)
if the user the web server runs as (let's call it www-data) is misconfigured and has a shell, how about ../../../../../home/www-data/.ssh/authorized_keys? (Again, this particular example should be guarded against by SSH itself (and possibly the folder not existing), since the authorized_keys file needs very particular file permissions; but if your system is set up to give restrictive file permissions by default (tricky!), then that won't be the problem.)
the filename could contain the x00 byte, or control characters. System programs may not respond to these as expected - e.g. a simple ls -al | cat (not that I know why you'd want to execute that, but a more complex script might contain a sequence that ultimately boils down to this) might execute commands.
the filename could end in .php and be executed once someone tries to download the file. (Don't try blacklisting extensions.)
The way to handle this is to roll the filenames yourself (e.g. md5() on the file contents or the original filename). If you absolutely must allow the original filename to best of your ability, whitelist the file extension, mime-type check the file, and whitelist what characters can be used in the filename.
Alternatively, you can roll the filename yourself when you store the file and for use in the URL that people use to download the file (although if this is a file-serving script, you should avoid letting people specify filenames here, anyway, so no one downloads your ../../../../../etc/passwd or other files of interest), but keep the original filename stored in the database for display somewhere. In this case, you only have SQL injection and XSS to worry about, which is ground that the other answers have already covered.
That depends where you store the filename. If you store the name in a database, in strictly typed variable, then HTML encode before you display it on a web page, there won't be any issues.
The name of the files could reveal potentially sensitive information. Some companies/people use different naming conventions for documents, so you might end up with :
Author name ( court-order-john.smith.doc )
Company name ( sensitive-information-enterprisename.doc )
File creation date ( letter.2012-03-29.pdf )
I think you get the point, you can probably think of some other information people use in their filenames.
Depending on what your site is about this could become an issue (consider if wikileaks published leaked documents that had the original source somewhere inside the filename).
If you decide to hide the filename, you must consider the problem of somebody submitting an executable as a document, and how you make sure people know what they are downloading.

How to upload Files with chinese names in PHP?

I have learning portal(LMS) where I will upload documents, images, videos etc to create content. If the file being uploaded has a chinese name then it is not getting uploaded. Instead a corrupted file with junk name is uploaded.
For example, I tried to upload a file named 地球科学.jpg. But on the server I got this file as 地çƒç§‘å­¦.jpg. Also the uploaded file is corrupted in the server.
I want this file to get upload with the same name on the server.
Because I want to search for these files and reuse later for creating content.
FYI:
I have XAMPP server installed on Windows XP.
Chinese, Korean, and Japanese language packs installed.
Thanks for your answers.
AFAIK ntfs can't handle some characters on the filesystem. I would suggest to store the file with a generic name.
for example you could create a table with two columns: name and file, as name you save the original name, and as file you set something like md5(name).
If you need the name to search for it use a database to store name information and the file location and save the file using your own convention.
Example
// sql entry
original name = 地球科学.jpg
path = /some/place/1.jpg
When you search you use the db to locate a given file name and location. Separation storage logic is something common when building image storage solutions not only for naming problems but also for limitations/spped considerations in terms of the number of files that accumulate in folders.
Use iconv or mb_convert_encoding to change character string encoding.
// Upload the file into the temp dir
$target_path = "uploadfiles/";
$target_path .= $_FILES['fileField']['name'];
// iconv()
move_uploaded_file($_FILES['fileField']['tmp_name'], iconv("UTF-8", "big5", $target_path))
// mb_convert_encoding()
move_uploaded_file($_FILES['fileField']['tmp_name'], mb_convert_encoding($target_path, "big5", "UTF-8"))
Make sure the page displaying the form is rendered in utf-8 or higher, usually this does the job, you can also choose to use the accept-charset attribute of the form element to indicate the posted data is sent as the specified charset.
Not sure if this all will do the job, let me know.
I think you might want to use somekind of database solution, especially when you need to search files later on. With database you can avoid I/O overhead.
I think you must learn/understand what character set the file is in before you can work out how to handle the upload. I'm afraid I'm not too familiar with non-european character sets and don't know which are most widely used.
UTF-8 should be a safe bet to handle almost whatever you care to throw at it. There's some relevant information that could be useful in terms of configuring your application in a post I wrote recently on my blog: How to Avoid Character Encoding Problems in PHP

File uploads with php - displaying a list of files

I am in the middle of making a script to upload files via php. What I would like to know, is how to display the files already uploaded, and when clicking on them open them for download. Should I store the names and path in a database, or just list the conents of a directory with php?
Check out handling file uploads in PHP. A few points:
Ideally you want to allow the user to upload multiple files at the same time. Just create extra file inputs dynamically with Javascript for this;
When you get an upload, make sure you check that it is an upload with is_uploaded_file;
Use move_uploaded_file() to copy the file to wherever you're going to store it;
Don't rely on what the client tells you the MIME type is;
Sending them back to the client can be done trivially with a PHP script but you need to know the right MIME type;
Try and verify that what you get is what you expect (eg if it is a PDF file use a library to verify that it is), particularly if you use the file for anything or send it to anyone else; and
I would recommend you store the file name of the file from the client's computer and display that to them regardless of what you store it as. The user is just more likely to recognise this than anything else.
Storing paths in the database might be okay, depending on your specific application, but consider storing the filenames in the database and construct your paths to those files in PHP in a single place. That way, if you end up moving all uploaded files later, there is only one place in your code you need to change path generation, and you can avoid doing a large amount of data transformation on your "path" field in the database.
For example, for the file 1234.txt, you might store it in:
/your_web_directory/uploaded_files/1/2/3/1234.txt
You can use a configuration file or if you prefer, a global somewhere to define the path where your uploads are stored (/your web directory/uploaded files/) and then split characters from the filename (in the database) to figure out which subdirectory the file actually resides in.
As for displaying your files, you can simply load your list of files from the database and use a path-generating function to get download paths for each one based on their filenames. If you want to paginate the list of files, try using something like START 0, LIMIT 50; in mySQL. Just pass in a new start number with each successive page of upload results.
maybe you should use files, in this sense:
myfile.txt
My Uploaded File||my_upload_dir/my_uploaded_file.pdf
Other Uploaded File||my_upload_dir/other_uploaded.html
and go through them like this:
<?php
$file = "myfile.txt";
$lines = file($file);
$files = array();
for($i=0;$i<=count($lines)-1;$i++) {
$parts = explode($lines[$i]);
$name = parts[0];
$filename = parts[1];
$files[$i][0] = $name;
$files[$i][1] = $filename;
}
print_r($files);
?>
hope this helps. :)
What I always did (past tense, I haven't written an upload script for ages) is, I'd link up an upload script (any upload script) to a simple database.
This offers some advantages;
You do not offer your users direct insight to your file system (what if there is a leak in your 'browse'-script and you expose your whole harddrive?
You can store extra information and meta-data in an easy and efficient way
You can actually query for files / meta-data instead of just looping through all the files
You can enable a 'safe-delete', where you delete the row, but keep the file (for example)
You can enable logging way more easily
Showing files in pages is easier
You can 'mask' files. Using a database enables you to store a 'masked' filename, and a 'real' filename.
Obviously, there are some disadvantages as well;
It is a little harder to migrate, since your file system and database have to be in sync
If an operation fails (on one of both ends) you have either a 'corrupt' database or file system
As mentioned before (but we can not mention enough, I'm afraid); _Keep your uploading safe!_
The MIME type / extension issue is one that is going on for ages.. I think most of the web is solid nowadays, but there used to be a time when developers would check either MIME type or extension, but never both (why bother?). This resulted in websites being very, very leaky.
If not written properly, upload scripts are big hole in your security. A great example of that is a website I 'hacked' a while back (on their request, of course). They supported the upload of images to a photoalbum, but they only checked on file extension. So I uploaded a GIF, with a directory scanner inside. This allowed me to scan through their whole system (since it wasn't a dedicated server; I could see a little more then that).
Hope I helped ;)

Categories