How to copy very large files from URL to server via PHP?

How to copy very large files from URL to server via PHP? - php

I use the following code to copy/download files from an external server (any server via a URL) to my hosted web server(Dreamhost shared hosting at default settings).
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<form method="post" action="copy.php">
<input type="submit" value="click" name="submit">
</form>
</body>
</html>
<!-- copy.php file contents -->
<?php
function chunked_copy() {
# 1 meg at a time, adjustable.
$buffer_size = 1048576;
$ret = 0;
$fin = fopen("http://www.example.com/file.zip", "rb");
$fout = fopen("file.zip", "w");
while(!feof($fin)) {
$ret += fwrite($fout, fread($fin, $buffer_size));
}
fclose($fin);
fclose($fout);
return $ret; # return number of bytes written
}
if(isset($_POST['submit']))
{
chunked_copy();
}
?>
However the function stops running at about once 2.5GB (sometimes 2.3GB and sometimes 2.7GB, etc) of the file has downloaded. This happens every time I execute this function. Smaller files (<2GB) rarely exhibit this problem. I believe nothing is wrong with the source as I separately downloaded the file flawlessly onto my home PC.
Can someone please remedy and explain this problem to me? I am very new to programming.
Also,
file_put_contents("Tmpfile.zip", fopen("http://example.com/file.zip", 'r'));
exhibits similar symptoms as well.

I think the problem might be the 30 second time-out on many servers running PHP scripts.
PHP scripts running via cron or shell wont have that problem so perhaps you could find a way to do it that way.
Alternatively you could add set_time_limit([desired time]) to the start of your code.

Explain: perhaps. Remidy: probably not.
It may be caused by the limits of PHP: the manual on the filesize function mentions in the section on the return value:
Note: Because PHP's integer type is signed and many platforms use 32bit integers, some filesystem functions may return unexpected results for files which are larger than 2GB.
It seems that the fopen function may cause the issue, as two comments (1, 2) were added (although modded down) on the subject.
It appears as if you need to compile PHP from source (with the CFLAGS="-D_FILE_OFFSET_BITS=64" flag) to enable large files (>2GB), but it might break some other functionality.
Since you're using shared histing: I guess you're out of luck.
Sorry...

Since the problem occurs at an (as yet) unknown and undefined file-size, perhaps it is best to try a work-around. What if you just close and than re-open the output file after some number of bytes?
function chunked_copy() {
# 1 meg at a time, adjustable.
$buffer_size = 1048576;
# 1 GB write-chuncks
$write_chuncks = 1073741824;
$ret = 0;
$fin = fopen("http://www.example.com/file.zip", "rb");
$fout = fopen("file.zip", "w");
$bytes_written = 0;
while(!feof($fin)) {
$bytes = fwrite($fout, fread($fin, $buffer_size));
$ret += $bytes;
$bytes_written += $bytes;
if ($bytes_written >= $write_chunks) {
// (another) chunck of 1GB has been written, close and reopen the stream
fclose($fout);
$fout = fopen("file.zip", "a"); // "a" for "append"
$bytes_written = 0; // re-start counting
}
}
fclose($fin);
fclose($fout);
return $ret; # return number of bytes written
}
The re-opening should be with the append-mode, which will place the write-pointer (there is no read-pointer) at the end of the file, not overwriting bytes written earlier.
This will not solve any Operating System-level or File System-level issues, but it may solve any counting issue internal to PHP while writing to files.
Perhaps this trick can (or should) also be applied on the reading-end, but I'm not sure if you can perform seek-operations on a download...
Note that any integer overflows (beyond 2147483647 if you're on 32-bit) should be transparently solved by casting to float, so that should not be an issue.
Edit: count the actual number of bytes written, not the chunk size

Maybe you can try curl to download file.
function downloadUrlToFile($url, $outFileName)
{
//file_put_contents($xmlFileName, fopen($link, 'r'));
//copy($link, $xmlFileName); // download xml file
if(is_file($url)) {
copy($url, $outFileName); // download xml file
} else {
$options = array(
CURLOPT_FILE => fopen($outFileName, 'w'),
CURLOPT_TIMEOUT => 28800, // set this to 8 hours so we dont timeout on big files
CURLOPT_URL => $url
);
$ch = curl_init();
curl_setopt_array($ch, $options);
curl_exec($ch);
curl_close($ch);
}
}

You get a time-out after 30s, probably caused by PHP (with default max_execution_time = 30s). You could try setting it to a larger time:
ini_set('max_execution_time', '300');
However, there are some caveats:
If the script is running in safe mode, you cannot set max_execution_time with ini_set (I could not find whether Dreamhost has safe mode on or off in shared hosting, you need to ask them, or just try this).
The web server may have an execution limit as well. Apache has this default to 300s (IIS as well, but given that Dreamhost provides 'full unix shell', Apache is more likely then IIS). But with a file size of 5GB, this should help you out.

This is the best way I found for downloading very large files : fast and no need lot of memory.
public function download_large_file(string $url, string $dest)
{
ini_set('memory_limit', '3000M');
ini_set('max_execution_time', '0');
try {
$handle1 = fopen($url, 'r');
$handle2 = fopen($dest, 'w');
stream_copy_to_stream($handle1, $handle2);
fclose($handle1);
fclose($handle2);
return true;
}
catch(\Exception $e) {
return $e->getMessage();
}
return true;
}

Related

MAMP strange behaviour : php read external file from an http:// is very slow, but from https:// is quick

I have a simple PHP script to read a remote file line-by-line, and then JSON decode it. On the production server all works ok, but on my local machine (MAMP stack, OSX) the PHP hangs. It is very slow, and takes more than 2 minutes to produce the JSON file. I think it's the json_decode() that is freezing. Why only on MAMP?
I think it's stuck in while loop, because I can't show the final $str variable that is the result of all the lines.
In case you are wondering why I need to read the file line-by-line, it's because in the real scenario, the remote JSON file is a 40MB text file. My only good performance result is like this, but any good suggestion?
Is there a configuration in php.ini to help solve this?
// The path to the JSON File
$fileName = 'http://www.xxxx.xxx/response-single.json';
//Open the file in "reading only" mode.
$fileHandle = fopen($fileName, "r");
//If we failed to get a file handle, throw an Exception.
if($fileHandle === false){
error_log("erro handle");
throw new Exception('Could not get file handle for: ' . $fileName);
}
//While we haven't reach the end of the file.
$str = "";
while(!feof($fileHandle)) {
//Read the current line in.
$line = fgets($fileHandle);
$str .= $line;
}
//Finally, close the file handle.
fclose($fileHandle);
$json = json_decode($str, true); // decode the JSON into an associative array
Thanks for your time.

I found the cause. It is path protocol.
With
$filename = 'http://www.yyy/response.json';
It freezes the server for 1 to 2 minutes.
I changed the file to another server with https protocol, and used
$filename = 'https://www.yyy/response.json';
and it works.

PHP Persist variable across all requests

In some languages C# or .NET this would be a static variable, but in PHP the memory is cleared after each request. I want the value to persist across all requests. I don't wan't $_SESSION because that is different for each user.
To help explain here is an example:
I want to have a script like this that will count up. No matter which user/browser opens the url.
<?php
function getServerVar($name){
...
}
function setServerVar($name,$val){
...
}
$count = getServerVar("count");
$count++;
setServerVar("count", $count);
echo $count;
I want the value stored in memory. It will not be something that needs to persist when apache restarts and the data is not that important that it needs to be thread safe.
UPDATE: I'm fine if it holds different values per server in a loadbalanced environment. Static variables in C# or Java will not be in sync either.

You would typically use a database to store the count.
However as an alternative you could do so using a file:
<?php
$file = 'count.txt';
if (!file_exists($file)) {
touch($file);
}
//Open the File Stream
$handle = fopen($file, "r+");
//Lock File, error if unable to lock
if(flock($handle, LOCK_EX)) {
$size = filesize($file);
$count = $size === 0 ? 0 : fread($handle, $size); //Get Current Hit Count
$count = $count + 1; //Increment Hit Count by 1
echo $count;
ftruncate($handle, 0); //Truncate the file to 0
rewind($handle); //Set write pointer to beginning of file
fwrite($handle, $count); //Write the new Hit Count
flock($handle, LOCK_UN); //Unlock File
} else {
echo "Could not Lock File!";
}
//Close Stream
fclose($handle);

In php your going to have to use an external store that all servers share. The most commonly used tool is memcached, but sql and redis both work fine for this use case.

The only way to do taht is, like bspates said, a tool that does not depend on any resource on your server. If you have various servers, you cannot rely on memory-based mechanisms on each machine.
You have to store this number outside the servers, because each server will store the value for its own file or memory.
File writting, like $_SESSION, will work if you have only one server to receive your requests. For more than one server, you need any type of database where all your servers will communicate with.

PHP - Chunked file copy (via FTP) has missing bytes?

So, I'm writing a chunked file transfer script that is intended to copy files--small and large--to a remote server. It almost works fantastically (and did with a 26 byte file I tested, haha) but when I start to do larger files, I notice it isn't quite working. For example, I uploaded a 96,489,231 byte file, but the final file was 95,504,152 bytes. I tested it with a 928,670,754 byte file, and the copied file only had 927,902,792 bytes.
Has anyone else ever experienced this? I'm guessing feof() may be doing something wonky, but I have no idea how to replace it, or test that. I commented the code, for your convenience. :)
<?php
// FTP credentials
$server = CENSORED;
$username = CENSORED;
$password = CENSORED;
// Destination file (where the copied file should go)
$destination = "ftp://$username:$password#$server/ftp/final.mp4";
// The file on my server that we're copying (in chunks) to $destination.
$read = 'grr.mp4';
// If the file we're trying to copy exists...
if (file_exists($read))
{
// Set a chunk size
$chunk_size = 4194304;
// For reading through the file we want to copy to the FTP server.
$read_handle = fopen($read, 'rb');
// For appending to the destination file.
$destination_handle = fopen($destination, 'ab');
echo '<span style="font-size:20px;">';
echo 'Uploading.....';
// Loop through $read until we reach the end of the file.
while (!feof($read_handle))
{
// So Rackspace doesn't think nothing's happening.
echo PHP_EOL;
flush();
// Read a chunk of the file we're copying.
$chunk = fread($read_handle, $chunk_size);
// Write the chunk to the destination file.
fwrite($destination_handle, $chunk);
sleep(1);
}
echo 'Done!';
echo '</span>';
}
fclose($read_handle);
fclose($destination_handle);
?>
EDIT
I (may have) confirmed that the script is dying at the end somehow, and not corrupting the files. I created a simple file with each line corresponding to the line number, up to 10000, then ran my script. It stopped at line 6253. However, the script is still returning "Done!" at the end, so I can't imagine it's a timeout issue. Strange!
EDIT 2
I have confirmed that the problem exists somewhere in fwrite(). By echoing $chunk inside the loop, the complete file is returned without fail. However, the written file still does not match.
EDIT 3
It appears to work if I add sleep(1) immediately after the fwrite(). However, that makes the script take a million years to run. Is it possible that PHP's append has some inherent flaw?
EDIT 4
Alright, further isolated the problem to being an FTP problem, somehow. When I run this file copy locally, it works fine. However, when I use the file transfer protocol (line 9) the bytes are missing. This is occurring despite the binary flags the two cases of fopen(). What could possibly be causing this?
EDIT 5
I found a fix. The modified code is above--I'll post an answer on my own as soon as I'm able.

I found a fix, though I'm not sure exactly why it works. Simply sleeping after writing each chunk fixes the problem. I upped the chunk size quite a bit to speed things up. Though this is an arguably bad solution, it should work for my uses. Thanks anyway, guys!

Running concurrent PHP scripts

I'm having the following problem with my VPS server.
I have a long-running PHP script that sends big files to the browser. It does something like this:
<?php
header("Content-type: application/octet-stream");
readfile("really-big-file.zip");
exit();
?>
This basically reads the file from the server's file system and sends it to the browser. I can't just use direct links(and let Apache serve the file) because there is business logic in the application that needs to be applied.
The problem is that while such download is running, the site doesn't respond to other requests.

The problem you are experiencing is related to the fact that you are using sessions. When a script has a running session, it locks the session file to prevent concurrent writes which may corrupt the session data. This means that multiple requests from the same client - using the same session ID - will not be executed concurrently, they will be queued and can only execute one at a time.
Multiple users will not experience this issue, as they will use different session IDs. This does not mean that you don't have a problem, because you may conceivably want to access the site whilst a file is downloading, or set multiple files downloading at once.
The solution is actually very simple: call session_write_close() before you start to output the file. This will close the session file, release the lock and allow further concurrent requests to execute.

Your server setup is probably not the only place you should be checking.
Try doing a request from your browser as usual and then do another from some other client.
Either wget from the same machine or another browser on a different machine.

In what way doesn't the server respond to other requests? Is it "Waiting for example.com..." or does it give an error of any kind?
I do something similar, but I serve the file chunked, which gives the file system a break while the client accepts and downloads a chunk, which is better than offering up the entire thing at once, which is pretty demanding on the file system and the entire server.
EDIT: While not the answer to this question, asker asked about reading a file chunked. Here's the function that I use. Supply it the full path to the file.
function readfile_chunked($file_path, $retbytes = true)
{
$buffer = '';
$cnt = 0;
$chunksize = 1 * (1024 * 1024); // 1 = 1MB chunk size
$handle = fopen($file_path, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, $chunksize);
echo $buffer;
ob_flush();
flush();
if ($retbytes) {
$cnt += strlen($buffer);
}
}
$status = fclose($handle);
if ($retbytes && $status) {
return $cnt; // return num. bytes delivered like readfile() does.
}
return $status;
}

I have tried different approaches (reading and sending the files in small chunks [see comments on readfile in PHP doc], using PEARs HTTP_Download) but I always ran into performance problems when the files are getting big.
There is an Apache mod X-Sendfile where you can do your business logic and then delegate the download to Apache. The download will not be publicly available. I think, this is the most elegant solution for the problem.
More Info:
http://tn123.org/mod_xsendfile/
http://www.brighterlamp.com/2010/10/send-files-faster-better-with-php-mod_xsendfile/

The same happens go to me and i'm not using sessions.
session.auto_start is set to 0
My example script only runs "sleep(5)", and adding "session_write_close()" at the beginning doesn't solve the problem.

Check your httpd.conf file. Maybe you have "KeepAlive On" and that is why your second request hangs until the first is completed. In general your PHP script should not allow the visitors to wait for long time. If you need to download something big, do it in a separate internal request that user have no direct control of. Until its done, return some "executing" status to the end user and when its done, process the actual results.

Why does readfile() exhaust PHP memory?

I've seen many questions about how to efficiently use PHP to download files rather than allowing direct HTTP requests (to keep files secure, to track downloads, etc.).
The answer is almost always PHP readfile().
Downloading large files reliably in PHP
How to force download of big files without using too much memory?
Best way to transparently log downloads?
BUT, although it works great during testing with huge files, when it's on a live site with hundreds of users, downloads start to hang and PHP memory limits are exhausted.
So what is it about how readfile() works that causes memory to blow up so bad when traffic is high? I thought it's supposed to bypass heavy use of PHP memory by writing directly to the output buffer?
EDIT: (To clarify, I'm looking for a "why", not "what can I do". I think that Apache's mod_xsendfile is the best way to circumvent)

Description
int readfile ( string $filename [, bool $use_include_path = false [, resource $context ]] )
Reads a file and writes it to the output buffer*.
PHP has to read the file and it writes to the output buffer.
So, for 300Mb file, no matter what the implementation you wrote (by many small segments, or by 1 big chunk) PHP has to read through 300Mb of file eventually.
If multiple user has to download the file, there will be a problem.
(In one server, hosting providers will limit memory given to each hosting user. With such limited memory, using buffer is not going to be a good idea. )
I think using the direct link to download a file is a much better approach for big files.

If you have output buffering on than use ob_end_flush() right before the call to readfile()
header(...);
ob_end_flush();
#readfile($file);

As mentioned here: "Allowed memory .. exhausted" when using readfile, the following block of code at the top of the php file did the trick for me.
This will checks if php output buffering is active. If so it turns it off.
if (ob_get_level()) {
ob_end_clean();
}

You might want to turn off output buffering altogether for that particular location, using PHP's output_buffering configuration directive.
Apache example:
<Directory "/your/downloadable/files">
...
php_admin_value output_buffering "0"
...
</Directory>
"Off" as the value seems to work as well, while it really should throw an error. At least according to how other types are converted to booleans in PHP. *shrugs*

Came up with this idea in the past (as part of my library) to avoid high memory usage:
function suTunnelStream( $sUrl, $sMimeType, $sCharType = null )
{
$f = #fopen( $sUrl, 'rb' );
if( $f === false )
{ return false; }
$b = false;
$u = true;
while( $u !== false && !feof($f ))
{
$u = #fread( $f, 1024 );
if( $u !== false )
{
if( !$b )
{ $b = true;
suClearOutputBuffers();
suCachedHeader( 0, $sMimeType, $sCharType, null, !suIsValidString($sCharType)?('content-disposition: attachment; filename="'.suUniqueId($sUrl).'"'):null );
}
echo $u;
}
}
#fclose( $f );
return ( $b && $u !== false );
}
Maybe this can give you some inspiration.

Well, it is memory intensive function. I would pipe users to a static server that has specific rule set in place to control downloads instead of using readfile().
If that's not an option add more RAM to satisfy the load or introduce queuing system that gracefully controls server usage.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to copy very large files from URL to server via PHP? - php

I think the problem might be the 30 second time-out on many servers running PHP scripts. PHP scripts running via cron or shell wont have that problem so perhaps you could find a way to do it that way. Alternatively you could add set_time_limit([desired time]) to the start of your code.

Related

MAMP strange behaviour : php read external file from an http:// is very slow, but from https:// is quick

PHP Persist variable across all requests

PHP - Chunked file copy (via FTP) has missing bytes?

Running concurrent PHP scripts

Why does readfile() exhaust PHP memory?

Categories

Resources