The problem
How to write data to start of file if I have not enough space to allocate it in RAM and I have not enough space to make it's copy on current FS partition? I.e. I have a file with 100Mb size, I have 30Mb memory limit in my PHP script (and it can not be adjusted in any way) and I have only 50Mb free on my current FS partition. I want to add 2-10 rows to file (it's definitely less than remaining 50Mb FS space)
Some background
I know about XY-problem and agree that it's true for this case. But to reconsider this case I'll need to change significant part of current application (actually, it went from previous team) and, may be, API of other applications that using this file.
My attempt
I have not found solution for this yet. My previous approach was - to use some network buffer (i.e. to connect to some external storage, such as MySQL, for example - it's located on another machine where there is enough space to write file's copy)
The question
So, is it possible to write data to file's start when I have not enough space to allocate it in RAM and have not enough space to create file's copy on FS? Is using network (external) storage the only solution?
Say you want to write 2K to the beginning of a file, your only real option is to:
open the file
read as much from the end of the file as you can fit into memory
write it back into the file 2K later than you started to read
continue with the previous block of data until you have shifted the entire content of the file 2K towards the end
write your 2K to the beginning
To visualize that:
|------------------------|
|-----------------XXXXXXX|
------>
|-------------------XXXXXXX|
|----------XXXXXXX---------|
------>
|------------XXXXXXX-------|
...repeat...
Note that this is a very unsafe operation which edits the file in place. If the process crashes, you're left with a file in an inconsistent state. If you don't have enough room on disk to duplicate a file you arguably shouldn't work with that file and expand your storage capacity first.
#deceze hint me great idea. So I've finished with:
function reverseFile($sIn, $sOut, $bRemoveSource=false)
{
$rFile = #fopen($sIn, 'a+');
$rTemp = #fopen($sOut,'a+');
if(!$rFile || !$rTemp)
{
return false;
}
$iPos = filesize($sIn)-1;
while($iPos>=0)
{
fseek($rFile, $iPos, SEEK_SET);
fwrite($rTemp, $tmp=fread($rFile, 1));
ftruncate($rFile, $iPos>0?$iPos:0);
clearstatcache();
$iPos--;
}
fclose($rFile);
fclose($rTemp);
if($bRemoveSource)
{
unlink($sIn);
}
return true;
}
function writeReverse($sFile, $sData, $sTemp=null)
{
if(!isset($sTemp))
{
$sTemp=$sFile.'.rev';
}
if(reverseFile($sFile, $sTemp, 1))
{
file_put_contents($sTemp, strrev($sData), FILE_APPEND);
return reverseFile($sTemp, $sFile, 1);
}
return false;
}
-it will be quite slow, but recoverable if process is interrupted (simply look to .rev file)
Thanks to all who participated in this.
I've tried code suggested by #AlmaDo, don't try it on real projects, or you will be burn in hell, it is VERY slow. (60MB file - processing 19minutes)
You can run shell script - https://stackoverflow.com/a/9533736/2064576 (processed 420ms, can not understand how much memory does it use)
Or try this php script - https://stackoverflow.com/a/16813550/2064576 (160ms, worked with memory_limit=3M, not worked with 2M)
Related
I have a process done in PHP. This process get a file from internet and put it inside a zip file. The target zipfile is based in an algorithm, there are 4096 zipfiles. The target zipfile is based in a hash of the url processed.
I have another program that launches http petitions so i can run the script concurrently (around 110 processes).
My question is simple. Since threads are pseudorandom, easly 2 threads can try to add files to the same zipfile in the same moment.
Is it possible? Will the file get corrupt if 2 proccess try to add files at same time?
Locking the file or something like that would be possible a possible solution.
I was thinking to use semaphores, but reading, php semaphores dont work under windows.
I have seen this possible solution:
if ( !function_exists('sem_get') ) {
function sem_get($key) { return fopen(__FILE__.'.sem.'.$key, 'w+'); }
function sem_acquire($sem_id) { return flock($sem_id, LOCK_EX); }
function sem_release($sem_id) { return flock($sem_id, LOCK_UN); }
}
Anyways the question is if it is allowed to add files to a zip file from 2 or more different php proccesses at same time.
Short answer: No! The zip algorithm analyses and compresses one stream at a time.
This is tough under Windows. It's far from easy in Linux! I would be tempted to create a db table with a unique index, and use that index number to determine a filename, or at least flag that a file is being written to.
So, I'm writing a chunked file transfer script that is intended to copy files--small and large--to a remote server. It almost works fantastically (and did with a 26 byte file I tested, haha) but when I start to do larger files, I notice it isn't quite working. For example, I uploaded a 96,489,231 byte file, but the final file was 95,504,152 bytes. I tested it with a 928,670,754 byte file, and the copied file only had 927,902,792 bytes.
Has anyone else ever experienced this? I'm guessing feof() may be doing something wonky, but I have no idea how to replace it, or test that. I commented the code, for your convenience. :)
<?php
// FTP credentials
$server = CENSORED;
$username = CENSORED;
$password = CENSORED;
// Destination file (where the copied file should go)
$destination = "ftp://$username:$password#$server/ftp/final.mp4";
// The file on my server that we're copying (in chunks) to $destination.
$read = 'grr.mp4';
// If the file we're trying to copy exists...
if (file_exists($read))
{
// Set a chunk size
$chunk_size = 4194304;
// For reading through the file we want to copy to the FTP server.
$read_handle = fopen($read, 'rb');
// For appending to the destination file.
$destination_handle = fopen($destination, 'ab');
echo '<span style="font-size:20px;">';
echo 'Uploading.....';
// Loop through $read until we reach the end of the file.
while (!feof($read_handle))
{
// So Rackspace doesn't think nothing's happening.
echo PHP_EOL;
flush();
// Read a chunk of the file we're copying.
$chunk = fread($read_handle, $chunk_size);
// Write the chunk to the destination file.
fwrite($destination_handle, $chunk);
sleep(1);
}
echo 'Done!';
echo '</span>';
}
fclose($read_handle);
fclose($destination_handle);
?>
EDIT
I (may have) confirmed that the script is dying at the end somehow, and not corrupting the files. I created a simple file with each line corresponding to the line number, up to 10000, then ran my script. It stopped at line 6253. However, the script is still returning "Done!" at the end, so I can't imagine it's a timeout issue. Strange!
EDIT 2
I have confirmed that the problem exists somewhere in fwrite(). By echoing $chunk inside the loop, the complete file is returned without fail. However, the written file still does not match.
EDIT 3
It appears to work if I add sleep(1) immediately after the fwrite(). However, that makes the script take a million years to run. Is it possible that PHP's append has some inherent flaw?
EDIT 4
Alright, further isolated the problem to being an FTP problem, somehow. When I run this file copy locally, it works fine. However, when I use the file transfer protocol (line 9) the bytes are missing. This is occurring despite the binary flags the two cases of fopen(). What could possibly be causing this?
EDIT 5
I found a fix. The modified code is above--I'll post an answer on my own as soon as I'm able.
I found a fix, though I'm not sure exactly why it works. Simply sleeping after writing each chunk fixes the problem. I upped the chunk size quite a bit to speed things up. Though this is an arguably bad solution, it should work for my uses. Thanks anyway, guys!
For a project I was working on I need a queue which will be too large to hold in normal memory. I had been implementing it as a simple file where it would read the whole file take the first few (~100) lines, process them, then write back the updated queue with new instructions added and the old ones removed. However, since the queue became too large to hold in memory like this I need something different. Preferably someone can tell me a way to peel off just the first few lines of a file without having to look at the rest of the data. I had thought about using a database (MySQL probably with sorted insert timestamps) but I would heavily prefer to do it without for load and bandwidth reasons (several servers would have to all be sending and receiving a lot of data from the DB). The language I'm working in is PHP but really this question is more about unix files I suppose. Any help would be appreciated.
Sucking out the first line of a file is pretty trivial (fopen() followed by an fgets()). Re-writing the file to remove completed jobs would be very painful, especially if you've got multiple concurrent servers working off the same queue file.
One alternative would be to use a seperate file for each job. If you have some concurrency-safe method of generating an incrementing ID for these files, then it'd be a simple matter of picking out the file with the lowest id for the oldest job, and generating a new id for each new job. You'd have to figure out some file locking, though, to keep two+ servers grabbing the same file at the same time, however.
I had same problems while I was working on enqueue/fs transport. I failed to modify a small portion at the begging of the file without copying it to the memory and saving back. Instead, but that's possible to do that with the end of the file. You can read a portion and then truncate it. That's not really a queue but a stack. So if you rely on message ordering this would not be a solution. In my case, I lock the file when the file has been read from the file, the lock is released.
This is how you could write messages to a queue file:
<?php
$rawMessage = 'this your message to put to the queue as a string';
$queueFile = fopen('/path/to/queue/file', '+a');
// here it may add some spaces so the message length is multiples of modular.
// that make it easier to read messages from a file.
// lock file
$rawMessage = str_repeat(' ', 64 - (strlen($rawMessage) % 64)).$rawMessage;
fwrite($queueFile, $rawMessage);
// release lock
This is how you could read messages from a queue file:
<?php
$queueFile = fopen('/path/to/queue/file', '+c');
// lock file
$frame = readFrame($file, 1);
ftruncate($file, fstat($file)['size'] - strlen($frame));
rewind($file);
$rawMessage = substr(trim($frame), 1);
// release lock
function readFrame($file, $frameNumber)
{
$frameSize = 64;
$offset = $frameNumber * $frameSize;
fseek($file, -$offset, SEEK_END);
$frame = fread($file, $frameSize);
if ('' == $frame) {
return '';
}
if (false !== strpos($frame, '|{')) {
return $frame;
}
return readFrame($file, $frameNumber + 1).$frame;
}
For the locking I'd suggest using Symfony LockHandler or simply take enqueue/fs.
What's the cleanest way in php to open a file, read the contents, and subsequently overwrite the file's contents with some output based on the original contents? Specifically, I'm trying to open a file populated with a list of items (separated by newlines), process/add items to the list, remove the oldest N entries from the list, and finally write the list back into the file.
fopen(<path>, 'a+')
flock(<handle>, LOCK_EX)
fread(<handle>, filesize(<path>))
// process contents and remove old entries
fwrite(<handle>, <contents>)
flock(<handle>, LOCK_UN)
fclose(<handle>)
Note that I need to lock the file with flock() in order to protect it across multiple page requests. Will the 'w+' flag when fopen()ing do the trick? The php manual states that it will truncate the file to zero length, so it seems that may prevent me from reading the file's current contents.
If the file isn't overly large (that is, you can be confident loading it won't blow PHP's memory limit), then the easiest way to go is to just read the entire file into a string (file_get_contents()), process the string, and write the result back to the file (file_put_contents()). This approach has two problems:
If the file is too large (say, tens or hundreds of megabytes), or the processing is memory-hungry, you're going to run out of memory (even more so when you have multiple instances of the thing running).
The operation is destructive; when the saving fails halfway through, you lose all your original data.
If any of these is a concern, plan B is to process the file and at the same time write to a temporary file; after successful completion, close both files, rename (or delete) the original file and then rename the temporary file to the original filename.
Read
$data = file_get_contents($filename);
Write
file_put_contents($filename, $data);
One solution is to use a separate lock file to control access.
This solution assumes that only your script, or scripts you have access to, will want to write to the file. This is because the scripts will need to know to check a separate file for access.
$file_lock = obtain_file_lock();
if ($file_lock) {
$old_information = file_get_contents('/path/to/main/file');
$new_information = update_information_somehow($old_information);
file_put_contents('/path/to/main/file', $new_information);
release_file_lock($file_lock);
}
function obtain_file_lock() {
$attempts = 10;
// There are probably better ways of dealing with waiting for a file
// lock but this shows the principle of dealing with the original
// question.
for ($ii = 0; $ii < $attempts; $ii++) {
$lock_file = fopen('/path/to/lock/file', 'r'); //only need read access
if (flock($lock_file, LOCK_EX)) {
return $lock_file;
} else {
//give time for other process to release lock
usleep(100000); //0.1 seconds
}
}
//This is only reached if all attempts fail.
//Error code here for dealing with that eventuality.
}
function release_file_lock($lock_file) {
flock($lock_file, LOCK_UN);
fclose($lock_file);
}
This should prevent a concurrently-running script reading old information and updating that, causing you to lose information that another script has updated after you read the file. It will allow only one instance of the script to read the file and then overwrite it with updated information.
While this hopefully answers the original question, it doesn't give a good solution to making sure all concurrent scripts have the ability to record their information eventually.
This first script gets called several times for each user via an AJAX request. It calls another script on a different server to get the last line of a text file. It works fine, but I think there is a lot of room for improvement but I am not a very good PHP coder, so I am hoping with the help of the community I can optimize this for speed and efficiency:
AJAX POST Request made to this script
<?php session_start();
$fileName = $_POST['textFile'];
$result = file_get_contents($_SESSION['serverURL']."fileReader.php?textFile=$fileName");
echo $result;
?>
It makes a GET request to this external script which reads a text file
<?php
$fileName = $_GET['textFile'];
if (file_exists('text/'.$fileName.'.txt')) {
$lines = file('text/'.$fileName.'.txt');
echo $lines[sizeof($lines)-1];
}
else{
echo 0;
}
?>
I would appreciate any help. I think there is more improvement that can be made in the first script. It makes an expensive function call (file_get_contents), well at least I think its expensive!
This script should limit the locations and file types that it's going to return.
Think of somebody trying this:
http://www.yoursite.com/yourscript.php?textFile=../../../etc/passwd (or something similar)
Try to find out where delays occur.. does the HTTP request take long, or is the file so large that reading it takes long.
If the request is slow, try caching results locally.
If the file is huge, then you could set up a cron job that extracts the last line of the file at regular intervals (or at every change), and save that to a file that your other script can access directly.
readfile is your friend here
it reads a file on disk and streams it to the client.
script 1:
<?php
session_start();
// added basic argument filtering
$fileName = preg_replace('/[^A-Za-z0-9_]/', '', $_POST['textFile']);
$fileName = $_SESSION['serverURL'].'text/'.$fileName.'.txt';
if (file_exists($fileName)) {
// script 2 could be pasted here
//for the entire file
//readfile($fileName);
//for just the last line
$lines = file($fileName);
echo $lines[count($lines)-1];
exit(0);
}
echo 0;
?>
This script could further be improved by adding caching to it. But that is more complicated.
The very basic caching could be.
script 2:
<?php
$lastModifiedTimeStamp filemtime($fileName);
if (isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
$browserCachedCopyTimestamp = strtotime(preg_replace('/;.*$/', '', $_SERVER['HTTP_IF_MODIFIED_SINCE']));
if ($browserCachedCopyTimestamp >= $lastModifiedTimeStamp) {
header("HTTP/1.0 304 Not Modified");
exit(0);
}
}
header('Content-Length: '.filesize($fileName));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 604800)); // (3600 * 24 * 7)
header('Last-Modified: '.date('D, d M Y H:i:s \G\M\T', $lastModifiedTimeStamp));
?>
First things first: Do you really need to optimize that? Is that the slowest part in your use case? Have you used xdebug to verify that? If you've done that, read on:
You cannot really optimize the first script usefully: If you need a http-request, you need a http-request. Skipping the http request could be a performance gain, though, if it is possible (i.e. if the first script can access the same files the second script would operate on).
As for the second script: Reading the whole file into memory does look like some overhead, but that is neglibable, if the files are small. The code looks very readable, I would leave it as is in that case.
If your files are big, however, you might want to use fopen() and its friends fseek() and fread()
# Do not forget to sanitize the file name here!
# An attacker could demand the last line of your password
# file or similar! ($fileName = '../../passwords.txt')
$filePointer = fopen($fileName, 'r');
$i = 1;
$chunkSize = 200;
# Read 200 byte chunks from the file and check if the chunk
# contains a newline
do {
fseek($filePointer, -($i * $chunkSize), SEEK_END);
$line = fread($filePointer, $i++ * $chunkSize);
} while (($pos = strrpos($line, "\n")) === false);
return substr($line, $pos + 1);
If the files are unchanging, you should cache the last line.
If the files are changing and you control the way they are produced, it might or might not be an improvement to reverse the order lines are written, depending on how often a line is read over its lifetime.
Edit:
Your server could figure out what it wants to write to its log, put it in memcache, and then write it to the log. The request for the last line could be fulfulled from memcache instead of file read.
The most probable source of delay is that cross-server HTTP request. If the files are small, the cost of fopen/fread/fclose is nothing compared to the whole HTTP request.
(Not long ago I used HTTP to retrieve images to dinamically generate image-based menus. Replacing the HTTP request by a local file read reduced the delay from seconds to tenths of a second.)
I assume that the obvious solution of accessing the file server filesystem directly is out of the question. If not, then it's the best and simplest option.
If not, you could use caching. Instead of getting the whole file, you just issue a HEAD request and compare the timestamp to a local copy.
Also, if you are ajax-updating a lot of clients based on the same files, you might consider looking at using comet (meteor, for example). It's used for things like chats, where a single change has to be broadcasted to several clients.