So, I'm writing a chunked file transfer script that is intended to copy files--small and large--to a remote server. It almost works fantastically (and did with a 26 byte file I tested, haha) but when I start to do larger files, I notice it isn't quite working. For example, I uploaded a 96,489,231 byte file, but the final file was 95,504,152 bytes. I tested it with a 928,670,754 byte file, and the copied file only had 927,902,792 bytes.
Has anyone else ever experienced this? I'm guessing feof() may be doing something wonky, but I have no idea how to replace it, or test that. I commented the code, for your convenience. :)
<?php
// FTP credentials
$server = CENSORED;
$username = CENSORED;
$password = CENSORED;
// Destination file (where the copied file should go)
$destination = "ftp://$username:$password#$server/ftp/final.mp4";
// The file on my server that we're copying (in chunks) to $destination.
$read = 'grr.mp4';
// If the file we're trying to copy exists...
if (file_exists($read))
{
// Set a chunk size
$chunk_size = 4194304;
// For reading through the file we want to copy to the FTP server.
$read_handle = fopen($read, 'rb');
// For appending to the destination file.
$destination_handle = fopen($destination, 'ab');
echo '<span style="font-size:20px;">';
echo 'Uploading.....';
// Loop through $read until we reach the end of the file.
while (!feof($read_handle))
{
// So Rackspace doesn't think nothing's happening.
echo PHP_EOL;
flush();
// Read a chunk of the file we're copying.
$chunk = fread($read_handle, $chunk_size);
// Write the chunk to the destination file.
fwrite($destination_handle, $chunk);
sleep(1);
}
echo 'Done!';
echo '</span>';
}
fclose($read_handle);
fclose($destination_handle);
?>
EDIT
I (may have) confirmed that the script is dying at the end somehow, and not corrupting the files. I created a simple file with each line corresponding to the line number, up to 10000, then ran my script. It stopped at line 6253. However, the script is still returning "Done!" at the end, so I can't imagine it's a timeout issue. Strange!
EDIT 2
I have confirmed that the problem exists somewhere in fwrite(). By echoing $chunk inside the loop, the complete file is returned without fail. However, the written file still does not match.
EDIT 3
It appears to work if I add sleep(1) immediately after the fwrite(). However, that makes the script take a million years to run. Is it possible that PHP's append has some inherent flaw?
EDIT 4
Alright, further isolated the problem to being an FTP problem, somehow. When I run this file copy locally, it works fine. However, when I use the file transfer protocol (line 9) the bytes are missing. This is occurring despite the binary flags the two cases of fopen(). What could possibly be causing this?
EDIT 5
I found a fix. The modified code is above--I'll post an answer on my own as soon as I'm able.
I found a fix, though I'm not sure exactly why it works. Simply sleeping after writing each chunk fixes the problem. I upped the chunk size quite a bit to speed things up. Though this is an arguably bad solution, it should work for my uses. Thanks anyway, guys!
Related
I have the following script that runs to read new content from a file:
<?php
clearstatcache();
$fileURL = "\\\\saturn\extern\seq_ws.csv";
$fileAvailable = file_exists($fileURL);
$bytesRead = file_get_contents("bytes.txt");
if($fileAvailable){
$fileSize = filesize($fileURL);
//Statusses 1 = Partial read, 2 = Complete read, 0 = No read, -1 File not found. followed by !!
if($bytesRead < $fileSize){
//$bytesRead till $fileSize bytes read from file.
$content = file_get_contents($fileURL, NULL, NULL, $bytesRead);
file_put_contents("bytes.txt", ((int)$bytesRead + strlen($content)));
echo "1!!$content";
}else if($bytesRead > $fileSize){
//File edit or delete detected, whole file read again.
$content = file_get_contents($fileURL);
file_put_contents("bytes.txt", strlen($content));
echo "2!!$content";
}else if($bytesRead == $fileSize){
//No new data found, no action taken.
echo "0!!";
}
}else{
//File delete detected, reading whole file when available
echo "-1!!";
file_put_contents("bytes.txt", "0");
}
?>
It works perfect when I run it and does what is expected.
When I edit the file from the same PC and my server it works instantly and returns the correct values.
However when I edit the file from another PC, my script takes about 4-6 seconds to read the correct filesize of the file.
I added clearstatcache(); on top of my script, because I think its a caching issue. But the strange thing is that when I change the file from the server PC it responds instantly, but from another it doesn't.
On top of that as soon as the other PC changes the file, I see the file change in Windows with the filesize and content but for some reason, it takes Apache about 4-6 seconds to detect the change. In those 4-6 seconds it receives the old filesize from before the change.
So I have the following questions:
Is the filesize information cached anywhere maybe either on the Apache server or inside Windows?
If question 1 applies, is there anyway to remove or disable this caching?
Is it possible this isnt a caching problem?
I think that in Your local PC php has development settings.
So I suggest to check php.ini for this param: realpath_cache_ttl
Which is:
realpath_cache_ttl integer
Duration of time (in seconds) for which to cache realpath
information for a given file or directory.
For systems with rarely changing files,
consider increasing the value.
to test it, php info both locally and on server to check that value:
<?php phpinfo();
Have a file in a website. A PHP script modifies it like this:
$contents = file_get_contents("MyFile");
// ** Modify $contents **
// Now rewrite:
$file = fopen("MyFile","w+");
fwrite($file, $contents);
fclose($file);
The modification is pretty simple. It grabs the file's contents and adds a few lines. Then it overwrites the file.
I am aware that PHP has a function for appending contents to a file rather than overwriting it all over again. However, I want to keep using this method since I'll probably change the modification algorithm in the future (so appending may not be enough).
Anyway, I was testing this out, making like 100 requests. Each time I call the script, I add a new line to the file:
First call:
First!
Second call:
First!
Second!
Third call:
First!
Second!
Third!
Pretty cool. But then:
Fourth call:
Fourth!
Fifth call:
Fourth!
Fifth!
As you can see, the first, second and third lines simply disappeared.
I've determined that the problem isn't the contents string modification algorithm (I've tested it separately). Something is messed up either when reading or writing the file.
I think it is very likely that the issue is when the file's contents are read: if $contents, for some odd reason, is empty, then the behavior shown above makes sense.
I'm no expert with PHP, but perhaps the fact that I performed 100 calls almost simultaneously caused this issue. What if there are two processes, and one is writing the file while the other is reading it?
What is the recommended approach for this issue? How should I manage file modifications when several processes could be writing/reading the same file?
What you need to do is use flock() (file lock)
What I think is happening is your script is grabbing the file while the previous script is still writing to it. Since the file is still being written to, it doesn't exist at the moment when PHP grabs it, so php gets an empty string, and once the later processes is done it overwrites the previous file.
The solution is to have the script usleep() for a few milliseconds when the file is locked and then try again. Just be sure to put a limit on how many times your script can try.
NOTICE:
If another PHP script or application accesses the file, it may not necessarily use/check for file locks. This is because file locks are often seen as an optional extra, since in most cases they aren't needed.
So the issue is parallel accesses to the same file, while one is writing to the file another instance is reading before the file has been updated.
PHP luckily has a mechanisms for locking the file so no one can read from it until the lock is released and the file has been updated.
flock()
can be used and the documentation is here
You need to create a lock, so that any concurrent requests will have to wait their turn. This can be done using the flock() function. You will have to use fopen(), as opposed to file_get_contents(), but it should not be a problem:
$file = 'file.txt';
$fh = fopen($file, 'r+');
if (flock($fh, LOCK_EX)) { // Get an exclusive lock
$data = fread($fh, filesize($file)); // Get the contents of file
// Do something with data here...
ftruncate($fh, 0); // Empty the file
fwrite($fh, $newData); // Write new data to file
fclose($fh); // Close handle and release lock
} else {
die('Unable to get a lock on file: '.$file);
}
My app reads a large file 5MB - 10MB that has been entered in with json entries line by line.
Each line is handled by a parser that is fed to multiple parsers and treated separately. Once the file is read, the file is moved. The Program is continuously fed with files to be processed.
The program currently works with #file_get_contents($filename). The program's structure works as is.
The problem is that file_get_contents loads the entire file as an array and the entire system runs for a minute. I suspect that I can gain speed if I read it line by line rather than wait for the file to load into memory (I might be wrong and open to suggestion).
There are too many file handler that does this. What is the most effective way to achieve what I need and which file reading method is best for this?
I have fopen, fread, readfile, file, and fscanf to contend with off the top of my head. However when I read the man for them - its all code to read generic files without a clear indication what is best for larger files.
$file = fopen("file.json", "r");
if ($file)
{
while (($line = fgets($file)) !== false)
{
echo $line;
}
}
else
{
echo "Unable to open the file";
}
Fgets read until it reach EOL, or EOF. if you want, you can add how much it should read using the second arg.
For more info about fgets: http://us3.php.net/fgets
I am trying to build a small demon in PHP that analyzes the logfiles on a linux system. (eg. follow the syslog).
I have managed to open the file via fopen and continuosly read it with stream_get_line. My problem starts when the monitored file is deleted and recreated (eg when rotating logs). The program then does not read anything anymore, even if the file grew larger than previously.
Is there an elegant solution for this? stream_get_meta_data does not help and using tail -f on the command line shows the same problem.
EDIT, added sample code
I tried to boil down the code to a minimum to illustrate what I am looking for
<?php
$break=FALSE;
$handle = fopen('./testlog.txt', 'r');
do {
$line = stream_get_line($handle, 100, "\n");
if(!empty($line)) {
// do something
echo $line;
}
while (feof($handle)) {
sleep (5);
$line = stream_get_line($handle, 100, "\n");
if(!empty($line)) {
// do something
echo $line;
}
// a commented on php.net indicated it is possible
// with tcp streams to distinguish empty and lost
// does NOT work here --> need somefunction($handle)
if($line !== FALSE && $line ='') $break=TRUE;
}
} while (!$break);
fclose($handle);
?>
When log files are rotated, the original file is copied, then deleted, and a new file with the same name is created. It may have the same name as the original file, but it has a different inode. Inodes (dumbed down description follows) are like hidden incremental index numbers for your files. You can change the name of a file, or move it, but it takes the inode with it. Once that original log file is deleted, you can't re-open a file with the same name using the same file handler, because the inode has changed. Your best bet is detect the failure, and attempt to open the new file.
This first script gets called several times for each user via an AJAX request. It calls another script on a different server to get the last line of a text file. It works fine, but I think there is a lot of room for improvement but I am not a very good PHP coder, so I am hoping with the help of the community I can optimize this for speed and efficiency:
AJAX POST Request made to this script
<?php session_start();
$fileName = $_POST['textFile'];
$result = file_get_contents($_SESSION['serverURL']."fileReader.php?textFile=$fileName");
echo $result;
?>
It makes a GET request to this external script which reads a text file
<?php
$fileName = $_GET['textFile'];
if (file_exists('text/'.$fileName.'.txt')) {
$lines = file('text/'.$fileName.'.txt');
echo $lines[sizeof($lines)-1];
}
else{
echo 0;
}
?>
I would appreciate any help. I think there is more improvement that can be made in the first script. It makes an expensive function call (file_get_contents), well at least I think its expensive!
This script should limit the locations and file types that it's going to return.
Think of somebody trying this:
http://www.yoursite.com/yourscript.php?textFile=../../../etc/passwd (or something similar)
Try to find out where delays occur.. does the HTTP request take long, or is the file so large that reading it takes long.
If the request is slow, try caching results locally.
If the file is huge, then you could set up a cron job that extracts the last line of the file at regular intervals (or at every change), and save that to a file that your other script can access directly.
readfile is your friend here
it reads a file on disk and streams it to the client.
script 1:
<?php
session_start();
// added basic argument filtering
$fileName = preg_replace('/[^A-Za-z0-9_]/', '', $_POST['textFile']);
$fileName = $_SESSION['serverURL'].'text/'.$fileName.'.txt';
if (file_exists($fileName)) {
// script 2 could be pasted here
//for the entire file
//readfile($fileName);
//for just the last line
$lines = file($fileName);
echo $lines[count($lines)-1];
exit(0);
}
echo 0;
?>
This script could further be improved by adding caching to it. But that is more complicated.
The very basic caching could be.
script 2:
<?php
$lastModifiedTimeStamp filemtime($fileName);
if (isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
$browserCachedCopyTimestamp = strtotime(preg_replace('/;.*$/', '', $_SERVER['HTTP_IF_MODIFIED_SINCE']));
if ($browserCachedCopyTimestamp >= $lastModifiedTimeStamp) {
header("HTTP/1.0 304 Not Modified");
exit(0);
}
}
header('Content-Length: '.filesize($fileName));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 604800)); // (3600 * 24 * 7)
header('Last-Modified: '.date('D, d M Y H:i:s \G\M\T', $lastModifiedTimeStamp));
?>
First things first: Do you really need to optimize that? Is that the slowest part in your use case? Have you used xdebug to verify that? If you've done that, read on:
You cannot really optimize the first script usefully: If you need a http-request, you need a http-request. Skipping the http request could be a performance gain, though, if it is possible (i.e. if the first script can access the same files the second script would operate on).
As for the second script: Reading the whole file into memory does look like some overhead, but that is neglibable, if the files are small. The code looks very readable, I would leave it as is in that case.
If your files are big, however, you might want to use fopen() and its friends fseek() and fread()
# Do not forget to sanitize the file name here!
# An attacker could demand the last line of your password
# file or similar! ($fileName = '../../passwords.txt')
$filePointer = fopen($fileName, 'r');
$i = 1;
$chunkSize = 200;
# Read 200 byte chunks from the file and check if the chunk
# contains a newline
do {
fseek($filePointer, -($i * $chunkSize), SEEK_END);
$line = fread($filePointer, $i++ * $chunkSize);
} while (($pos = strrpos($line, "\n")) === false);
return substr($line, $pos + 1);
If the files are unchanging, you should cache the last line.
If the files are changing and you control the way they are produced, it might or might not be an improvement to reverse the order lines are written, depending on how often a line is read over its lifetime.
Edit:
Your server could figure out what it wants to write to its log, put it in memcache, and then write it to the log. The request for the last line could be fulfulled from memcache instead of file read.
The most probable source of delay is that cross-server HTTP request. If the files are small, the cost of fopen/fread/fclose is nothing compared to the whole HTTP request.
(Not long ago I used HTTP to retrieve images to dinamically generate image-based menus. Replacing the HTTP request by a local file read reduced the delay from seconds to tenths of a second.)
I assume that the obvious solution of accessing the file server filesystem directly is out of the question. If not, then it's the best and simplest option.
If not, you could use caching. Instead of getting the whole file, you just issue a HEAD request and compare the timestamp to a local copy.
Also, if you are ajax-updating a lot of clients based on the same files, you might consider looking at using comet (meteor, for example). It's used for things like chats, where a single change has to be broadcasted to several clients.