Why does readfile() exhaust PHP memory? - php

I've seen many questions about how to efficiently use PHP to download files rather than allowing direct HTTP requests (to keep files secure, to track downloads, etc.).
The answer is almost always PHP readfile().
Downloading large files reliably in PHP
How to force download of big files without using too much memory?
Best way to transparently log downloads?
BUT, although it works great during testing with huge files, when it's on a live site with hundreds of users, downloads start to hang and PHP memory limits are exhausted.
So what is it about how readfile() works that causes memory to blow up so bad when traffic is high? I thought it's supposed to bypass heavy use of PHP memory by writing directly to the output buffer?
EDIT: (To clarify, I'm looking for a "why", not "what can I do". I think that Apache's mod_xsendfile is the best way to circumvent)

Description
int readfile ( string $filename [, bool $use_include_path = false [, resource $context ]] )
Reads a file and writes it to the output buffer*.
PHP has to read the file and it writes to the output buffer.
So, for 300Mb file, no matter what the implementation you wrote (by many small segments, or by 1 big chunk) PHP has to read through 300Mb of file eventually.
If multiple user has to download the file, there will be a problem.
(In one server, hosting providers will limit memory given to each hosting user. With such limited memory, using buffer is not going to be a good idea. )
I think using the direct link to download a file is a much better approach for big files.

If you have output buffering on than use ob_end_flush() right before the call to readfile()
header(...);
ob_end_flush();
#readfile($file);

As mentioned here: "Allowed memory .. exhausted" when using readfile, the following block of code at the top of the php file did the trick for me.
This will checks if php output buffering is active. If so it turns it off.
if (ob_get_level()) {
ob_end_clean();
}

You might want to turn off output buffering altogether for that particular location, using PHP's output_buffering configuration directive.
Apache example:
<Directory "/your/downloadable/files">
...
php_admin_value output_buffering "0"
...
</Directory>
"Off" as the value seems to work as well, while it really should throw an error. At least according to how other types are converted to booleans in PHP. *shrugs*

Came up with this idea in the past (as part of my library) to avoid high memory usage:
function suTunnelStream( $sUrl, $sMimeType, $sCharType = null )
{
$f = #fopen( $sUrl, 'rb' );
if( $f === false )
{ return false; }
$b = false;
$u = true;
while( $u !== false && !feof($f ))
{
$u = #fread( $f, 1024 );
if( $u !== false )
{
if( !$b )
{ $b = true;
suClearOutputBuffers();
suCachedHeader( 0, $sMimeType, $sCharType, null, !suIsValidString($sCharType)?('content-disposition: attachment; filename="'.suUniqueId($sUrl).'"'):null );
}
echo $u;
}
}
#fclose( $f );
return ( $b && $u !== false );
}
Maybe this can give you some inspiration.

Well, it is memory intensive function. I would pipe users to a static server that has specific rule set in place to control downloads instead of using readfile().
If that's not an option add more RAM to satisfy the load or introduce queuing system that gracefully controls server usage.

Related

creating only new files in PHP without cpu intensive code

In my cache system, I want it where if a new page is requested, a check is made to see if a file exists and if it doesn't then a copy is stored on the server, If it does exist, then it must not be overwritten.
The problem I have is that I may be using functions designed to be slow.
This is part of my current implementation to save files:
if (!file_exists($filename)){$h=fopen($filename,"wb");if ($h){fwrite($h,$c);fclose($h);}}
This is part of my implementation to load files:
if (($m=#filemtime($file)) !== false){
if ($m >= filemtime("sitemodification.file")){
$outp=file_get_contents($file);
header("Content-length:".strlen($outp),true);echo $outp;flush();exit();
}
}
What I want to do is replace this with a better set of functions meant for performance and yet still achieve the same functionality. All caching files including sitemodification.file reside on a ramdisk. I added a flush before exit in hopes that content will be outputted faster.
I can't use direct memory addressing at this time because the file sizes to be stored are all different.
Is there a set of functions I can use that can execute the code I provided faster by at least a few milliseconds, especially the loading files code?
I'm trying to keep my time to first byte low.
First, prefer is_file to file_exists and use file_put_contents:
if ( !is_file($filename) ) {
file_put_contents($filename,$c);
}
Then, use the proper function for this kind of work, readfile:
if ( ($m = #filemtime($file)) !== false && $m >= filemtime('sitemodification.file')) {
header('Content-length:'.filesize($file));
readfile($file);
}
}
You should see a little improvement but keep in mind that file accesses are slow and you check three times for files access before sending any content.

How to avoid a possible missing cache file in PHP?

I have a simple caching system as
if (file_exists($cache)) {
echo file_get_contents($cache);
// if coming here when $cache is deleting, then nothing to display
}
else {
// PHP process
}
We regularly delete outdated cache files, e.g. deleting all caches after 1 hour. Although this process is very fast, but I am thinking that a cache file can be deleted right between the if statement and file_get_contents processes.
I mean when if statement checks the existence of cache file, it exists; but when file_get_contents tries to catch it, it is no longer there (deleted by simultaneous cache deleting process).
file_get_contents locks the file to avoid the undergoing delete process during the read process. But the file can be deleted when the if statement sends the PHP process to the first condition (before start of the file_get_contents).
Is there any approach to avoid this? Is the cache deleting system different?
NOTE: I did not face any practical problem, as it is not very probable to catch this event, but logically it is possible, and should happen on heavy loads.
Luckily file_get_contents return FALSE on error, so you could quick-bake it like:
if (FALSE !== ($buffer = file_get_contents())) {
echo $buffer;
return;
}
// PHP process
or similiar. It's a bit the quick and dirty way, considering you want to place the # operator to hide any warnings about non-existent files:
if (FALSE !== ($buffer = #file_get_contents())) {
The other alternative would be to lock, however that might prevent your cache-deletion to not delete the file if you have locked it.
Then left is to stall the cache your own. That means reading the file-creation time in PHP, check that it is < 5 minutes then for the file-deletion processing (5 minutes is exemplary) and then you would know that the file is already stale and for being replaced with fresh content. Re-create the file then. Otherwise read the file in, which probably is better then with readfile instead of file_get_contents and echo.
On failure, file_get_contents returns false, so what about this:
if (($output = file_get_contents($filename)) === false){
// Do the processing.
$output = 'Generated content';
// Save cache file
file_put_contents($filename, $output);
}
echo $output;
By the way, you may want to consider using fpassthru, which is more memory-efficient, especially for larger files. Using file_get_contents on large files (> 100 MB), will probably cause problems (depending on your configuration).
<?php
$fp = #fopen($filename, 'rb');
if ($fp === false){
// Generate output
} else {
fpassthru($fp);
}

Running concurrent PHP scripts

I'm having the following problem with my VPS server.
I have a long-running PHP script that sends big files to the browser. It does something like this:
<?php
header("Content-type: application/octet-stream");
readfile("really-big-file.zip");
exit();
?>
This basically reads the file from the server's file system and sends it to the browser. I can't just use direct links(and let Apache serve the file) because there is business logic in the application that needs to be applied.
The problem is that while such download is running, the site doesn't respond to other requests.
The problem you are experiencing is related to the fact that you are using sessions. When a script has a running session, it locks the session file to prevent concurrent writes which may corrupt the session data. This means that multiple requests from the same client - using the same session ID - will not be executed concurrently, they will be queued and can only execute one at a time.
Multiple users will not experience this issue, as they will use different session IDs. This does not mean that you don't have a problem, because you may conceivably want to access the site whilst a file is downloading, or set multiple files downloading at once.
The solution is actually very simple: call session_write_close() before you start to output the file. This will close the session file, release the lock and allow further concurrent requests to execute.
Your server setup is probably not the only place you should be checking.
Try doing a request from your browser as usual and then do another from some other client.
Either wget from the same machine or another browser on a different machine.
In what way doesn't the server respond to other requests? Is it "Waiting for example.com..." or does it give an error of any kind?
I do something similar, but I serve the file chunked, which gives the file system a break while the client accepts and downloads a chunk, which is better than offering up the entire thing at once, which is pretty demanding on the file system and the entire server.
EDIT: While not the answer to this question, asker asked about reading a file chunked. Here's the function that I use. Supply it the full path to the file.
function readfile_chunked($file_path, $retbytes = true)
{
$buffer = '';
$cnt = 0;
$chunksize = 1 * (1024 * 1024); // 1 = 1MB chunk size
$handle = fopen($file_path, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, $chunksize);
echo $buffer;
ob_flush();
flush();
if ($retbytes) {
$cnt += strlen($buffer);
}
}
$status = fclose($handle);
if ($retbytes && $status) {
return $cnt; // return num. bytes delivered like readfile() does.
}
return $status;
}
I have tried different approaches (reading and sending the files in small chunks [see comments on readfile in PHP doc], using PEARs HTTP_Download) but I always ran into performance problems when the files are getting big.
There is an Apache mod X-Sendfile where you can do your business logic and then delegate the download to Apache. The download will not be publicly available. I think, this is the most elegant solution for the problem.
More Info:
http://tn123.org/mod_xsendfile/
http://www.brighterlamp.com/2010/10/send-files-faster-better-with-php-mod_xsendfile/
The same happens go to me and i'm not using sessions.
session.auto_start is set to 0
My example script only runs "sleep(5)", and adding "session_write_close()" at the beginning doesn't solve the problem.
Check your httpd.conf file. Maybe you have "KeepAlive On" and that is why your second request hangs until the first is completed. In general your PHP script should not allow the visitors to wait for long time. If you need to download something big, do it in a separate internal request that user have no direct control of. Until its done, return some "executing" status to the end user and when its done, process the actual results.

PHP Script takes almost 15 seconds to load

I've written a PHP script that iterates through a given folder, extracts all the images from there and displays them on an HTML page (as tags). The size of the page is about 14 KB, but it takes the page almost 15 seconds.
Here's the code:
function displayGallery( $gallery, $rel, $first_image ) {
$prefix = "http://www.example.com/";
$path_to_gallery = "gallery_albums/" . $gallery . "/";
$handler = opendir( $path_to_gallery ); //opens directory
while ( ( $file = readdir( $handler ) ) !== false ) {
if ( strcmp( $file, "." ) != 0 && strcmp( $file, ".." ) !=0 ) {
//check for "." and ".." files
if ( isImage( $prefix . $path_to_gallery . $file ) ) {
echo '';
}
}
}
closedir( $handler ); //closes directory
}
function isImage($image_file) {
if (getimagesize($image_file)!==false) {
return true;
} else {
return false;
}
}
I looked at other posts, but most of them deal with SQL queries, and that's not my case.
Any suggestions how to optimize this?
You can use a PHP profiler like http://xdebug.org/docs/profiler to find what part of the script is taking forever to run. It might be overkill for this issue, but long-term you may be glad you took the time now to set it up.
I suppose that's because you've added $prefix in the isImage invokation. That way this function actually downloads all your images directly from your webserver instead of looking them up locally.
you may use use getimagesize(), it issues E_NOTICE and returns FALSE when file is not a known image type.
An out of left field suggestion here. You don't state how you are clocking the execution time. If you are clocking it in the browser, as in taking 15 seconds to load the page from a link, the problem could have nothing at all to do with your script. I have seen people in the past create similar pages trying to use images as tags, and they take forever to load because even though they are displaying the image at thumbnail size or smaller, the image itself is still 800 x 600 or something. I know it sounds daft, but make sure that you are not just displaying large images in a small size. It would be perfectly reasonable for a script to require 15 seconds to load and display 76 800 x 600 jpegs.
My assumption is that isImage is the problem. I've never seen it before. Why not just check for particular file extensions? That's pretty quick.
Update: You might also try switching to use exif_imagetype() which is likely faster than getimagesize() Putting that check into the top function is also going to be faster. Neither of those functions was meant to be done over a web connection - avoid that altogether. Best to stick with the file extension.
Do you not already have access to the files directly? Every time you look something up over the web, it's going to take a while - you need to wait for the entire file to download. Look up the files directly on your system.
Use scandir to get all the filenames at once into an array and walk through them. That will likely speed things up as I assume there won't be a back and forth to get things individually.
Instead of doing strcmp for . and .. just do $file != '.' && $file != '..'
Also, the speed is going to depend on the number of files being returned, if there are a lot it's going to be slow. The OS can slow down with too many files in a directory as well. You're looping over all files and directories, not just images so that's the number that counts, not just the images.
getimagesize is the problem, it took 99.1% of the script time.
Version #1 - Orignal case
Version #2 - If you really need to use getimagesize with URL (http://). Then a faster alternative, found in http://www.php.net/manual/en/function.getimagesize.php#88793 . It reads only the X first bytes of the image. XHProf shows it is x10 faster. Another ideas also could be using curl multi for parallel download https://stackoverflow.com/search?q=getimagesize+alternative
Version #3 - I think this is the best suitable for your case is to open files as normal files systems without (http://). This is even x100 faster per Xhprof

How can I optimize this simple PHP script?

This first script gets called several times for each user via an AJAX request. It calls another script on a different server to get the last line of a text file. It works fine, but I think there is a lot of room for improvement but I am not a very good PHP coder, so I am hoping with the help of the community I can optimize this for speed and efficiency:
AJAX POST Request made to this script
<?php session_start();
$fileName = $_POST['textFile'];
$result = file_get_contents($_SESSION['serverURL']."fileReader.php?textFile=$fileName");
echo $result;
?>
It makes a GET request to this external script which reads a text file
<?php
$fileName = $_GET['textFile'];
if (file_exists('text/'.$fileName.'.txt')) {
$lines = file('text/'.$fileName.'.txt');
echo $lines[sizeof($lines)-1];
}
else{
echo 0;
}
?>
I would appreciate any help. I think there is more improvement that can be made in the first script. It makes an expensive function call (file_get_contents), well at least I think its expensive!
This script should limit the locations and file types that it's going to return.
Think of somebody trying this:
http://www.yoursite.com/yourscript.php?textFile=../../../etc/passwd (or something similar)
Try to find out where delays occur.. does the HTTP request take long, or is the file so large that reading it takes long.
If the request is slow, try caching results locally.
If the file is huge, then you could set up a cron job that extracts the last line of the file at regular intervals (or at every change), and save that to a file that your other script can access directly.
readfile is your friend here
it reads a file on disk and streams it to the client.
script 1:
<?php
session_start();
// added basic argument filtering
$fileName = preg_replace('/[^A-Za-z0-9_]/', '', $_POST['textFile']);
$fileName = $_SESSION['serverURL'].'text/'.$fileName.'.txt';
if (file_exists($fileName)) {
// script 2 could be pasted here
//for the entire file
//readfile($fileName);
//for just the last line
$lines = file($fileName);
echo $lines[count($lines)-1];
exit(0);
}
echo 0;
?>
This script could further be improved by adding caching to it. But that is more complicated.
The very basic caching could be.
script 2:
<?php
$lastModifiedTimeStamp filemtime($fileName);
if (isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
$browserCachedCopyTimestamp = strtotime(preg_replace('/;.*$/', '', $_SERVER['HTTP_IF_MODIFIED_SINCE']));
if ($browserCachedCopyTimestamp >= $lastModifiedTimeStamp) {
header("HTTP/1.0 304 Not Modified");
exit(0);
}
}
header('Content-Length: '.filesize($fileName));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 604800)); // (3600 * 24 * 7)
header('Last-Modified: '.date('D, d M Y H:i:s \G\M\T', $lastModifiedTimeStamp));
?>
First things first: Do you really need to optimize that? Is that the slowest part in your use case? Have you used xdebug to verify that? If you've done that, read on:
You cannot really optimize the first script usefully: If you need a http-request, you need a http-request. Skipping the http request could be a performance gain, though, if it is possible (i.e. if the first script can access the same files the second script would operate on).
As for the second script: Reading the whole file into memory does look like some overhead, but that is neglibable, if the files are small. The code looks very readable, I would leave it as is in that case.
If your files are big, however, you might want to use fopen() and its friends fseek() and fread()
# Do not forget to sanitize the file name here!
# An attacker could demand the last line of your password
# file or similar! ($fileName = '../../passwords.txt')
$filePointer = fopen($fileName, 'r');
$i = 1;
$chunkSize = 200;
# Read 200 byte chunks from the file and check if the chunk
# contains a newline
do {
fseek($filePointer, -($i * $chunkSize), SEEK_END);
$line = fread($filePointer, $i++ * $chunkSize);
} while (($pos = strrpos($line, "\n")) === false);
return substr($line, $pos + 1);
If the files are unchanging, you should cache the last line.
If the files are changing and you control the way they are produced, it might or might not be an improvement to reverse the order lines are written, depending on how often a line is read over its lifetime.
Edit:
Your server could figure out what it wants to write to its log, put it in memcache, and then write it to the log. The request for the last line could be fulfulled from memcache instead of file read.
The most probable source of delay is that cross-server HTTP request. If the files are small, the cost of fopen/fread/fclose is nothing compared to the whole HTTP request.
(Not long ago I used HTTP to retrieve images to dinamically generate image-based menus. Replacing the HTTP request by a local file read reduced the delay from seconds to tenths of a second.)
I assume that the obvious solution of accessing the file server filesystem directly is out of the question. If not, then it's the best and simplest option.
If not, you could use caching. Instead of getting the whole file, you just issue a HEAD request and compare the timestamp to a local copy.
Also, if you are ajax-updating a lot of clients based on the same files, you might consider looking at using comet (meteor, for example). It's used for things like chats, where a single change has to be broadcasted to several clients.

Categories