Serving large files with PHP

Serving large files with PHP - php

So I am trying to serve large files via a PHP script, they are not in a web accessible directory, so this is the best way I can figure to provide access to them.
The only way I could think of off the bat to serve this file is by loading it into memory (fopen, fread, ect.), setting the header data to the proper MIME type, and then just echoing the entire contents of the file.
The problem with this is, I have to load these ~700MB files into memory all at once, and keep the entire thing there till the download is finished. It would be nice if I could stream in the parts that I need as they are downloading.
Any ideas?

You don't need to read the whole thing - just enter a loop reading it in, say, 32Kb chunks and sending it as output. Better yet, use fpassthru which does much the same thing for you....
$name = 'mybigfile.zip';
$fp = fopen($name, 'rb');
// send the right headers
header("Content-Type: application/zip");
header("Content-Length: " . filesize($name));
// dump the file and stop the script
fpassthru($fp);
exit;
even less lines if you use readfile, which doesn't need the fopen call...
$name = 'mybigfile.zip';
// send the right headers
header("Content-Type: application/zip");
header("Content-Length: " . filesize($name));
// dump the file and stop the script
readfile($name);
exit;
If you want to get even cuter, you can support the Content-Range header which lets clients request a particular byte range of your file. This is particularly useful for serving PDF files to Adobe Acrobat, which just requests the chunks of the file it needs to render the current page. It's a bit involved, but see this for an example.

The best way to send big files with php is the X-Sendfile header. It allows the webserver to serve files much faster through zero-copy mechanisms like sendfile(2). It is supported by lighttpd and apache with a plugin.
Example:
$file = "/absolute/path/to/file"; // can be protected by .htaccess
header('X-Sendfile: '.$file);
header('Content-type: application/octet-stream');
header('Content-Disposition: attachment; filename="'.basename($file).'"');
// other headers ...
exit;
The server reads the X-Sendfile header and sends out the file.

While fpassthru() has been my first choice in the past, the PHP manual actually recommends* using readfile() instead, if you are just dumping the file as-is to the client.
* "If you just want to dump the contents of a file to the output buffer, without first modifying it or seeking to a particular offset, you may want to use the readfile(), which saves you the fopen() call." —PHP manual

If your files are not accessible by the web server because the path is not in your web serving directory (htdocs) then you can make a symbolic link (symlink) to that folder in your web serving directory to avoid passing all traffic trough php.
You can do something like this
ln -s /home/files/big_files_folder /home/www/htdocs
Using php for serving static files is a lot slower, if you have high traffic, memory consumption will be very large and it may not handle a big number of requests.

Have a look at fpassthru(). In more recent versions of PHP this should serve the files without keeping them in memory, as this comment states.

Strange, neither fpassthru() nor readfile() did it for me, always had a memory error.
I resorted to use passthru() without the 'f':
$name = 'mybigfile.zip';
// send the right headers
header("Content-Type: application/zip");
header("Content-Length: " . filesize($name));
// dump the file and stop the script
passthru('/bin/cat '.$filename);
exit;
this execs 'cat' Unix command and send its output to the browser.
comment for slim: the reason you just don't put a symlink to somewhere is webspace is SECURITY.

One of benefits of fpassthru() is that this function can work not only with files but any valid handle. Socket for example.
And readfile() must be a little faster, cause of using OS caching mechanism, if possible (as like as file_get_contents()).
One more tip. fpassthru() hold handle open, until client gets content (which may require quite a long time on slow connect), and so you must use some locking mechanism if parallel writes to this file possible.

The Python answers are all good. But is there any reason you can't make a web accessible directory containing symbolic links to the actual files? It may take some extra server configuration, but it ought to work.

If you want to do it right, PHP alone can't do it. You would want to serve the file by using Nginx's X-Accel-Redirect (Recommended) or Apache's X-Sendfile, which are built exactly for this purpose.
I will include in this answer some text found on this article.
Why not serve the files with PHP:
Done naively, the file is read into memory and then served. If the
files are large, this could cause your server to run out of memory.
Caching headers are often not set correctly. This causes web browsers
to re-download the file multiple times even if it hasn't changed.
Support for HEAD requests and range requests is typically not
automatically supported. If the files are large, serving such files
ties up a worker process or thread. This can lead to starvation if
there are limited workers available. Increasing the number of workers
can cause your server to run out of memory.
NGINX handles all of these things properly. So let's handle permission checks in the application and let NGINX serve the actual file. This is where internal redirects come in. The idea is simple: you can configure a location entry as usual when serving regular files.
Add this to your nginx server block:
location /protected_files/ {
internal;
alias /var/www/my_folder_with_protected_files/;
}
In your project, require the HTTP Foundation package:
composer require symfony/http-foundation
Serve the files in PHP using Nginx:
use Symfony\Component\HttpFoundation\BinaryFileResponse;
$real_path = '/var/www/my_folder_with_protected_files/foo.pdf';
$x_accel_redirect_path = '/protected_files/foo.pdf';
BinaryFileResponse::trustXSendfileTypeHeader();
$response = new BinaryFileResponse( $real_path );
$response->headers->set( 'X-Accel-Redirect', $accel_file );
$response->sendHeaders();
exit;
This should be the basic you need to get started.
Here's a more complete example serving an Inline PDF:
use Symfony\Component\HttpFoundation\BinaryFileResponse;
use Symfony\Component\HttpFoundation\File\File;
use Symfony\Component\HttpFoundation\ResponseHeaderBag;
$real_path = '/var/www/my_folder_with_protected_files/foo.pdf';
$x_accel_redirect_path = '/protected_files/foo.pdf';
$file = new File( $file_path );
BinaryFileResponse::trustXSendfileTypeHeader();
$response = new BinaryFileResponse( $file_path );
$response->setImmutable( true );
$response->setPublic();
$response->setAutoEtag();
$response->setAutoLastModified();
$response->headers->set( 'Content-Type', 'application/pdf' );
$response->headers->set( 'Content-Length', $file->getSize() );
$response->headers->set( 'X-Sendfile-Type', 'X-Accel-Redirect' );
$response->headers->set( 'X-Accel-Redirect', $accel_file );
$response->headers->set( 'X-Accel-Expires', 60 * 60 * 24 * 90 ); // 90 days
$response->headers->set( 'X-Accel-Limit-Rate', 10485760 ); // 10mb/s
$response->headers->set( 'X-Accel-Buffering', 'yes' );
$response->setContentDisposition( ResponseHeaderBag::DISPOSITION_INLINE, basename( $file_path ) ); // view in browser. Change to DISPOSITION_ATTACHMENT to download
$response->sendHeaders();
exit;

Related

download file with PHP through varnish generating them directly on the output buffer

the scenario is:
Download a file that is generate directly onto php://output write into it.
Without varnish the behavior is the file is downloaded properly while the server writing on the buffer.
with varnish the client is waiting until the whole file is generated and then download the file.
Is there a particular configuration for varnish to accomplish the start immediatly download the file instead of waiting the full generated file?
I already try to pass the URL (Varnish rule to skip the caching mechanism) where the file is generated, but writing into the buffer it doesn't make sense, doesn it?
EDIT
from a perspective PHP view, It open a filestream on php://output and write into that stream
$out = fopen( 'php://output', 'w' );
fputcsv( $out, $whatever ); // or fwrite

I found the solution in varnish configuration:
It is enough to do not cache that specific url (and http verbs)
something like this:
if ((req.url ~ "/url") && (req.url == "POST")) {
return(pipe);
}

Downloading large(ish) zip served with PHP gets corrupted for people with a slow connection

I'm a novice, so I'll try and do my best to explain a problem I'm having. I apologize in advance if there's something I left out or is unclear.
I'm serving an 81MB zip file outside my root directory to people who are validated beforehand. I've been getting reports of corrupted downloads or an inability to complete the download. I've verified this happening on my machine if I simulate a slow connection.
I'm on shared hosting running Apache-Coyote/1.1.
I get a network timeout error. I think my host might be doing killing the downloads if they take too long, but they haven't verified either way.
I thought I was maybe running into a memory limit or time limit, so my host installed the apache module XSendFile. My headers in the file that handles the download after validation are being set this way:
<?php
set_time_limit(0);
$file = '/absolute/path/to/myzip/myzip.zip';
header("X-Sendfile: $file");
header("Content-type: application/zip");
header('Content-Disposition: attachment; filename="' . basename($file) . '"');
Any help or suggestions would be appreciated. Thanks!

I would suggest taking a look at this comment:
http://www.php.net/manual/en/function.readfile.php#99406
Particularly, if you are using apache. If not the code in the link above should be helpful:
I started running into trouble when I had really large files being sent to clients with really slow download speeds. In those cases, the
script would time out and the download would terminate with an
incomplete file. I am dead-set against disabling script timeouts - any
time that is the solution to a programming problem, you are doing
something wrong - so I attempted to scale the timeout based on the
size of the file. That ultimately failed though because it was
impossible to predict the speed at which the end user would be
downloading the file at, so it was really just a best guess so
inevitably we still get reports of script timeouts.
Then I stumbled across a fantastic Apache module called mod_xsendfile ( https://tn123.org/mod_xsendfile/ (binaries) or
https://github.com/nmaier/mod_xsendfile (source)). This module
basically monitors the output buffer for the presence of special
headers, and when it finds them it triggers apache to send the file on
its own, almost as if the user requested the file directly. PHP
processing is halted at that point, so no timeout errors regardless of
the size of the file or the download speed of the client. And the end
client gets the full benefits of Apache sending the file, such as an
accurate file size report and download status bar.
The code I finally ended up with is too long to post here, but in general is uses the mod_xsendfile module if it is present, and if not
the script falls back to using the code I originally posted. You can
find some example code at https://gist.github.com/854168
EDIT
Just to have a reference of code that does the "chunking" Link to Original Code:
<?php
function readfile_chunked ($filename,$type='array') {
$chunk_array=array();
$chunksize = 1*(1024*1024); // how many bytes per chunk
$buffer = '';
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
switch($type)
{
case'array':
// Returns Lines Array like file()
$lines[] = fgets($handle, $chunksize);
break;
case'string':
// Returns Lines String like file_get_contents()
$lines = fread($handle, $chunksize);
break;
}
}
fclose($handle);
return $lines;
}
?>

How to return a file in PHP

I have a file
/file.zip
A user comes to
/download.php
I want the user's browser to start downloading the file. How do i do that? Does readfile open the file on server, which seems like an unnecessary thing to do. Is there a way to return the file without opening it on the server?

I think you want this:
$attachment_location = $_SERVER["DOCUMENT_ROOT"] . "/file.zip";
if (file_exists($attachment_location)) {
header($_SERVER["SERVER_PROTOCOL"] . " 200 OK");
header("Cache-Control: public"); // needed for internet explorer
header("Content-Type: application/zip");
header("Content-Transfer-Encoding: Binary");
header("Content-Length:".filesize($attachment_location));
header("Content-Disposition: attachment; filename=file.zip");
readfile($attachment_location);
die();
} else {
die("Error: File not found.");
}

readfile will do the job OK and pass the stream straight back to the webserver. It's not the best solution as for the time the file is sent, PHP still runs. For better results you'll need something like X-SendFile, which is supported on most webservers (if you install the correct modules).
In general (if you care about heavy load), it's best to put a proxying webserver in front of your main application server. This will free up your application server (for instance apache) up quicker, and proxy servers (Varnish, Squid) tend to be much better at transfering bytes to clients with high latency or clients that are generally slow.

If the file is publicly accessable, just do a simple redirect to the URL of your file.

If the file is public, then you can just serve it as a static file directly from the web server (e.g. Apache), and make download.php redirect to the static URL. Otherwise, you have to use readfile to send the file to the browser after authenticating the user (remember about the Content-Dispositon header).

LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echo it to the user or save it to disk and redirect (deleting it some time in the future).
However, doing things that way has drawbacks:
a initial phase of intensive CPU and disk thrashing, resulting in...
a considerable initial delay to the user while the archive is prepared
very high memory footprint per request
use of substantial temporary disk space
if the user cancels the download half way through, all resources used in the initial phase (CPU, memory, disk) will have been wasted
Solutions like ZipStream-PHP improve on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.
In contrast, consider the following bash snippet:
ls -1 | zip -# - | cat > file.zip
# Note -# is not supported on MacOS
Here, zip operates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zip works only as fast as its output can be written by cat.
The optimal way, then, would be to do the same: replace cat with a web server process, streaming the zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.
How can you achieve this on a LAMP stack?

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.
Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).
(Note: don't use flush(). See the update below for details.)
Something like the following can do the trick:
<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');
// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');
// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
$buff = fread($fp, $bufsize);
echo $buff;
}
pclose($fp);
You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.
If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.
Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.
Update (from Benji's excellent feedback in the comments section on this answer)
1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.
[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.
There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.
The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).
2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.
3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.
4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:
Content-type: application/zip
or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):
Content-type: application/octet-stream
and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):
Content-disposition: attachment; filename="file.zip"
One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?
Finally, here's a revised example that uses all of #Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):
<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');
// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');
// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
$buff = fread($fp, $bufsize);
echo $buff;
}
pclose($fp);
Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.
The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

Another solution is my mod_zip module for Nginx, written specifically for this purpose:
https://github.com/evanmiller/mod_zip
It is extremely lightweight and does not invoke a separate "zip" process or communicate via pipes. You simply point to a script that lists the locations of files to be included, and mod_zip does the rest.

Trying to implement a dynamic generated download with lots of files with different sizes i came across this solution but i run into various memory errors like "Allowed memory size of 134217728 bytes exhausted at ...".
After adding ob_flush(); right before the flush(); the memory errors disappear.
Together with sending the headers, my final solution looks like this (Just storing the files inside the zip without directory structure):
<?php
// Sending headers
header('Content-Type: application/zip');
header('Content-Disposition: attachment; filename="download.zip"');
header('Content-Transfer-Encoding: binary');
ob_clean();
flush();
// On the fly zip creation
$fp = popen('zip -0 -j -q -r - file1 file2 file3', 'r');
while (!feof($fp)) {
echo fread($fp, 8192);
ob_flush();
flush();
}
pclose($fp);

I wrote this s3 steaming file zipper microservice last weekend - might be useful: http://engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/

According to the PHP manual, the ZIP extension provides a zip: wrapper.
I have never used it and I don't know its internals, but logically it should be able to do what you're looking for, assuming that ZIP archives can be streamed, which I'm not entirely sure of.
As for your question about the "LAMP stack" it shouldn't be a problem as long as PHP is not configured to buffer output.
Edit: I'm trying to put a proof-of-concept together, but it seems not-trivial. If you're not experienced with PHP's streams, it might prove too complicated, if it's even possible.
Edit(2): rereading your question after taking a look at ZipStream, I found what's going to be your main problem here when you say (emphasis added)
the operative Zipping should operate in streaming mode, ie processing files and providing data at the rate of the download.
That part will be extremely hard to implement because I don't think PHP provides a way to determine how full Apache's buffer is. So, the answer to your question is no, you probably won't be able to do that in PHP.

It seems, you can eliminate any output-buffer related problems by using fpassthru(). I also use -0 to save CPU time since my data is compact already. I use this code to serve a whole folder, zipped on-the-fly:
chdir($folder);
$fp = popen('zip -0 -r - .', 'r');
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="'.basename($folder).'.zip"');
fpassthru($fp);

I just released a ZipStreamWriter class written in pure PHP userland here:
https://github.com/cubiclesoft/php-zipstreamwriter
Instead of using external applications (e.g. zip) or extensions like ZipArchive, it supports streaming data into and out of the class by implementing a full-blown ZIP writer.
How the streaming aspect works is by using the ZIP file format's "Data Descriptors" as described by section 4.3.5 of the PKWARE ZIP file specification:
4.3.5 File data MAY be followed by a "data descriptor" for the file.
Data descriptors are used to facilitate ZIP file streaming.
There are some possible limitations to be aware of though. Not every tool can read streaming ZIP files. Also, support for Zip64 streaming ZIP files may have even less support but that's only of concern for files over 2GB with this class. However, both 7-Zip and the Windows 10 built-in ZIP file reader seem to be fine with handling all of crazy files that the ZipStreamWriter class threw at them. The hex editor I use got a good workout too.
When using the ZipStreamWriter class, I recommend allowing a buffer to build up to at least 4KB but no more than 65KB at a time before sending it on to the web server. Otherwise, for lots of really tiny files, you'll be flushing out tiny bits of piecemeal data and waste a bunch of extra CPU cycles on the Apache callback end of things.
When something doesn't exist or I don't like the existing options, I find both official and unofficial specifications, some examples to work with, and then I build it from scratch. It's a fairly solid approach to problem solving, if just a tad overkill.

php most memory efficient way to return files

so i have a bunch of files, some can be up to 30-40mb
and i want to use php to handle security of the files, so i can control who has access to them
that means i have a script sort of like this rough example
$has_permission = check_database_for_permission($user, filename);
if ($has_permission) {
header('Content-Type: image/jpeg');
readfile ($filename);
exit;
} else {
// return 401 error
}
i would hate for every request to load the full file into memory, as it would soon chew up all the memory on my server with a few simultaneous requests
so a couple of questions
is readfile the most memory efficient way of doing this?
is there some better method of achieving the same outcome, that i am overlooking?
server: apache/php5
thanks

readfile is the correct way to do this. By all means don't try to read the file yourself and print it to output--that will consume excessive memory. With the readfile function the contents of the file are buffered directly to output, taking up a trivial amount of transitory memory.

The fastest way is when you can relay this to the webserver. The webserver can use the sendfile() call to ask the operating system kernel to directly copy from a file to the network stream.
for instance when using lighttpd there is a way that PHP can signal the server to take over and do the sendfile trick:
http://redmine.lighttpd.net/projects/lighttpd/wiki/X-LIGHTTPD-send-file

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.