Concurrent writing to zip file

Concurrent writing to zip file - php

I have a process done in PHP. This process get a file from internet and put it inside a zip file. The target zipfile is based in an algorithm, there are 4096 zipfiles. The target zipfile is based in a hash of the url processed.
I have another program that launches http petitions so i can run the script concurrently (around 110 processes).
My question is simple. Since threads are pseudorandom, easly 2 threads can try to add files to the same zipfile in the same moment.
Is it possible? Will the file get corrupt if 2 proccess try to add files at same time?
Locking the file or something like that would be possible a possible solution.
I was thinking to use semaphores, but reading, php semaphores dont work under windows.
I have seen this possible solution:
if ( !function_exists('sem_get') ) {
function sem_get($key) { return fopen(__FILE__.'.sem.'.$key, 'w+'); }
function sem_acquire($sem_id) { return flock($sem_id, LOCK_EX); }
function sem_release($sem_id) { return flock($sem_id, LOCK_UN); }
}
Anyways the question is if it is allowed to add files to a zip file from 2 or more different php proccesses at same time.

Short answer: No! The zip algorithm analyses and compresses one stream at a time.
This is tough under Windows. It's far from easy in Linux! I would be tempted to create a db table with a unique index, and use that index number to determine a filename, or at least flag that a file is being written to.

Related

Find out whether fopen(..., 'a') created a new file

in my PHP project, I use some kind of counter that appends to an existing (or new) file very often:
$f = fopen($filename, 'ab');
fwrite($f);
fclose($f);
When a new file is created, I have to edit this file's permissions, so another user may access the file as well:
$existed = file_exists($filename);
// Do the append
$f = fopen($filename, 'ab');
fwrite($f);
fclose($f);
// Update permissions
if (!$existed) {
#chmod($filename, 0666);
}
Is there any way to find out, whether 'a' (append) created a new file or appended to an exiting one without using file_exists()? To my understanding, file_exists() retrieves the file stats, which causes some unnecessary overhead compared to a simple file-append. As the function is used very often, I wonder if there's an option to tell if fopen(..., 'a') created a new file without using file_exists()?
Note: This is mostly a question of style and interest, not a true performance issue. But if I am mistaken and fopen() already retrieves the file stats, please let me know!
Update
Okay, it really is a rather academic question. Here're some performance tests run on a windows system (Apache, Win8.1 - no UNIX file permissions) and a linux machine (Nginx, Ubuntu 14.04, virutal machine).
Each test run with 1000 repetitions, file deleted before the first repetition.
Win Linux
simply append one byte 1.8ms 9.4ms
append + clearstatcache() 1.8ms 9.3ms
test fileexists() + append 2.2ms 10.5ms
fileexists() + append + clear 2.2ms 11.0ms
append + chmod() 2.7ms 12.3ms
append + fileexists() -> chmod() 3.3ms 10.6ms
Note: The last one is the only one that uses and IF within the test loop.

The php fopen is just a call to the libc fopen, that automatically creates a file for the modes w,w+,a and a+. As far as I can see, there is no way to get the stat with the permission bits from the returned file pointer.
It seems that PHP stores the stat array for each opened file and you can access it with fstat($fp) with the opened file handle $fp. But the mode field contains inode permission bits. I can't immediately see how "inode permission bits" are related to the "UNIX file mode". The stat system call does not use this term.
You can use "r+" mode to open your file and create it if that fails. If not you need to SEEK to then end to achieve something similar.
But finally it's best to check for existence before you open the file.

No, fopen just returns the resource, it doesn't return or set a flag that indicates whether the file already existed - http://php.net/manual/en/function.fopen.php

EDIT: see the performance test in the edited question.
Why not calling chmod() every times?
Your file_exist() is probably (maybe a little performance test...) more expensive than a chmod().
// Do the append
$f = fopen($filename, 'ab');
fwrite($f);
fclose($f);
// Update persmissions
#chmod($this->filename, 0666);

About PHP parallel file read/write

Have a file in a website. A PHP script modifies it like this:
$contents = file_get_contents("MyFile");
// ** Modify $contents **
// Now rewrite:
$file = fopen("MyFile","w+");
fwrite($file, $contents);
fclose($file);
The modification is pretty simple. It grabs the file's contents and adds a few lines. Then it overwrites the file.
I am aware that PHP has a function for appending contents to a file rather than overwriting it all over again. However, I want to keep using this method since I'll probably change the modification algorithm in the future (so appending may not be enough).
Anyway, I was testing this out, making like 100 requests. Each time I call the script, I add a new line to the file:
First call:
First!
Second call:
First!
Second!
Third call:
First!
Second!
Third!
Pretty cool. But then:
Fourth call:
Fourth!
Fifth call:
Fourth!
Fifth!
As you can see, the first, second and third lines simply disappeared.
I've determined that the problem isn't the contents string modification algorithm (I've tested it separately). Something is messed up either when reading or writing the file.
I think it is very likely that the issue is when the file's contents are read: if $contents, for some odd reason, is empty, then the behavior shown above makes sense.
I'm no expert with PHP, but perhaps the fact that I performed 100 calls almost simultaneously caused this issue. What if there are two processes, and one is writing the file while the other is reading it?
What is the recommended approach for this issue? How should I manage file modifications when several processes could be writing/reading the same file?

What you need to do is use flock() (file lock)
What I think is happening is your script is grabbing the file while the previous script is still writing to it. Since the file is still being written to, it doesn't exist at the moment when PHP grabs it, so php gets an empty string, and once the later processes is done it overwrites the previous file.
The solution is to have the script usleep() for a few milliseconds when the file is locked and then try again. Just be sure to put a limit on how many times your script can try.
NOTICE:
If another PHP script or application accesses the file, it may not necessarily use/check for file locks. This is because file locks are often seen as an optional extra, since in most cases they aren't needed.

So the issue is parallel accesses to the same file, while one is writing to the file another instance is reading before the file has been updated.
PHP luckily has a mechanisms for locking the file so no one can read from it until the lock is released and the file has been updated.
flock()
can be used and the documentation is here

You need to create a lock, so that any concurrent requests will have to wait their turn. This can be done using the flock() function. You will have to use fopen(), as opposed to file_get_contents(), but it should not be a problem:
$file = 'file.txt';
$fh = fopen($file, 'r+');
if (flock($fh, LOCK_EX)) { // Get an exclusive lock
$data = fread($fh, filesize($file)); // Get the contents of file
// Do something with data here...
ftruncate($fh, 0); // Empty the file
fwrite($fh, $newData); // Write new data to file
fclose($fh); // Close handle and release lock
} else {
die('Unable to get a lock on file: '.$file);
}

Write data to start of file when not enough space

The problem
How to write data to start of file if I have not enough space to allocate it in RAM and I have not enough space to make it's copy on current FS partition? I.e. I have a file with 100Mb size, I have 30Mb memory limit in my PHP script (and it can not be adjusted in any way) and I have only 50Mb free on my current FS partition. I want to add 2-10 rows to file (it's definitely less than remaining 50Mb FS space)
Some background
I know about XY-problem and agree that it's true for this case. But to reconsider this case I'll need to change significant part of current application (actually, it went from previous team) and, may be, API of other applications that using this file.
My attempt
I have not found solution for this yet. My previous approach was - to use some network buffer (i.e. to connect to some external storage, such as MySQL, for example - it's located on another machine where there is enough space to write file's copy)
The question
So, is it possible to write data to file's start when I have not enough space to allocate it in RAM and have not enough space to create file's copy on FS? Is using network (external) storage the only solution?

Say you want to write 2K to the beginning of a file, your only real option is to:
open the file
read as much from the end of the file as you can fit into memory
write it back into the file 2K later than you started to read
continue with the previous block of data until you have shifted the entire content of the file 2K towards the end
write your 2K to the beginning
To visualize that:
|------------------------|
|-----------------XXXXXXX|
------>
|-------------------XXXXXXX|
|----------XXXXXXX---------|
------>
|------------XXXXXXX-------|
...repeat...
Note that this is a very unsafe operation which edits the file in place. If the process crashes, you're left with a file in an inconsistent state. If you don't have enough room on disk to duplicate a file you arguably shouldn't work with that file and expand your storage capacity first.

#deceze hint me great idea. So I've finished with:
function reverseFile($sIn, $sOut, $bRemoveSource=false)
{
$rFile = #fopen($sIn, 'a+');
$rTemp = #fopen($sOut,'a+');
if(!$rFile || !$rTemp)
{
return false;
}
$iPos = filesize($sIn)-1;
while($iPos>=0)
{
fseek($rFile, $iPos, SEEK_SET);
fwrite($rTemp, $tmp=fread($rFile, 1));
ftruncate($rFile, $iPos>0?$iPos:0);
clearstatcache();
$iPos--;
}
fclose($rFile);
fclose($rTemp);
if($bRemoveSource)
{
unlink($sIn);
}
return true;
}
function writeReverse($sFile, $sData, $sTemp=null)
{
if(!isset($sTemp))
{
$sTemp=$sFile.'.rev';
}
if(reverseFile($sFile, $sTemp, 1))
{
file_put_contents($sTemp, strrev($sData), FILE_APPEND);
return reverseFile($sTemp, $sFile, 1);
}
return false;
}
-it will be quite slow, but recoverable if process is interrupted (simply look to .rev file)
Thanks to all who participated in this.

I've tried code suggested by #AlmaDo, don't try it on real projects, or you will be burn in hell, it is VERY slow. (60MB file - processing 19minutes)
You can run shell script - https://stackoverflow.com/a/9533736/2064576 (processed 420ms, can not understand how much memory does it use)
Or try this php script - https://stackoverflow.com/a/16813550/2064576 (160ms, worked with memory_limit=3M, not worked with 2M)

What's the best way to read from and then overwrite file contents in php?

What's the cleanest way in php to open a file, read the contents, and subsequently overwrite the file's contents with some output based on the original contents? Specifically, I'm trying to open a file populated with a list of items (separated by newlines), process/add items to the list, remove the oldest N entries from the list, and finally write the list back into the file.
fopen(<path>, 'a+')
flock(<handle>, LOCK_EX)
fread(<handle>, filesize(<path>))
// process contents and remove old entries
fwrite(<handle>, <contents>)
flock(<handle>, LOCK_UN)
fclose(<handle>)
Note that I need to lock the file with flock() in order to protect it across multiple page requests. Will the 'w+' flag when fopen()ing do the trick? The php manual states that it will truncate the file to zero length, so it seems that may prevent me from reading the file's current contents.

If the file isn't overly large (that is, you can be confident loading it won't blow PHP's memory limit), then the easiest way to go is to just read the entire file into a string (file_get_contents()), process the string, and write the result back to the file (file_put_contents()). This approach has two problems:
If the file is too large (say, tens or hundreds of megabytes), or the processing is memory-hungry, you're going to run out of memory (even more so when you have multiple instances of the thing running).
The operation is destructive; when the saving fails halfway through, you lose all your original data.
If any of these is a concern, plan B is to process the file and at the same time write to a temporary file; after successful completion, close both files, rename (or delete) the original file and then rename the temporary file to the original filename.

Read
$data = file_get_contents($filename);
Write
file_put_contents($filename, $data);

One solution is to use a separate lock file to control access.
This solution assumes that only your script, or scripts you have access to, will want to write to the file. This is because the scripts will need to know to check a separate file for access.
$file_lock = obtain_file_lock();
if ($file_lock) {
$old_information = file_get_contents('/path/to/main/file');
$new_information = update_information_somehow($old_information);
file_put_contents('/path/to/main/file', $new_information);
release_file_lock($file_lock);
}
function obtain_file_lock() {
$attempts = 10;
// There are probably better ways of dealing with waiting for a file
// lock but this shows the principle of dealing with the original
// question.
for ($ii = 0; $ii < $attempts; $ii++) {
$lock_file = fopen('/path/to/lock/file', 'r'); //only need read access
if (flock($lock_file, LOCK_EX)) {
return $lock_file;
} else {
//give time for other process to release lock
usleep(100000); //0.1 seconds
}
}
//This is only reached if all attempts fail.
//Error code here for dealing with that eventuality.
}
function release_file_lock($lock_file) {
flock($lock_file, LOCK_UN);
fclose($lock_file);
}
This should prevent a concurrently-running script reading old information and updating that, causing you to lose information that another script has updated after you read the file. It will allow only one instance of the script to read the file and then overwrite it with updated information.
While this hopefully answers the original question, it doesn't give a good solution to making sure all concurrent scripts have the ability to record their information eventually.

Creating files on a time (hourly) basis

I experimenting with twitter streaming API,
I use Phirehose to connect to twitter and fetch the data but having problems storing it in files for further processing.
Basically what I want to do is to create a file named
date("YmdH")."."txt"
for every hour of connection.
Here is how my code looks like right now (not handling the hourly change of files)
public function enqueueStatus($status)
$data = json_decode($status,true);
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/$time.txt");
fwirte ($status,$fp);
fclose($fp);
}
Help is as always much appreciated :)

You want the 'append' mode in fopen - this will either append to a file or create it.
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/" . date("YmdH") . ".txt", "a");
fwrite ($status,$fp);
fclose($fp);
}

From the Phirehose googlecode wiki:
As of Phirehose version 0.2.2 there is
an example of a simple "ghetto queue"
included in the tarball (see file:
ghetto-queue-collect.php and
ghetto-queue-consume.php) that shows
how statuses could be easily collected
on to the filesystem for processing
and then picked up by a separate
process (consume).
This is a complete working sample of doing what you want to do. The rotation time interval is configurable too. Additionally there's another script to consume and process the written files too.
Now if only I could find a way to stop the whole sript, my log keeps filling up (the script continues execution) even if I close the browser tab :P

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.