concurrent file read/write

concurrent file read/write - php

What happens when many requests are received to read & write to a file in PHP? Do the requests get queued? Or is only one accepted and the rest are discarded?
I'm planning to use a text based hit counter.

You can encounter the problem of race condition
To avoid this if you only need simple append data you can use
file_put_contents(,,FILE_APPEND|LOCK_EX);
and don't worry about your data integrity.
If you need more complex operation you can use flock (used for simple reader/writer problem)
For your PHP script counter I suggest you to do with something like this:
//> Register this impression
file_put_contents( $file, "\n", FILE_APPEND|LOCK_EX );
//> Read the total number of impression
echo count(file($file));
This way you don't have to implement a blocking mechanism and you can keep the system and your code script lighter
Addendum
To avoid to have to count the array file() you can keep the system even lighter with this:
//> Register this impression
file_put_contents( $file, '1', FILE_APPEND|LOCK_EX );
//> Read the total number of impression
echo filesize($file);
Basically to read the number of your counter you just need to read its filesize considering each impression add 1 byte to it

No, requests will not be queued, reader will get damaged data, writers will overwrite each other, data will be damaged.
You can try to use flock and x mode of fopen.
It's not so easy to code good locking mutex, so try to find existing variant, or try to move data from file to DB.

You can use flock() to get a lock on the file prior to read/write to it. If other threads are holding a lock on the file, flock() will wait until the other locks are released.

Related

Laravel download file from php output buffer VS. private storage folder | security

A user can download query results in CSV format. The file is small (a few KB), but the contents are important.
The first approach is to use php output buffer php://:
$callback = function() use ($result, $columns) {
$file = fopen('php://output', 'w');
fputcsv($file, $columns);
foreach($result as $res) {
fputcsv($file, array($res->from_user, $res->to_user, $res->message, $res->date_added));
}
fclose($file);
};
return response()->stream($callback, 200, $headers);
The second approach is to create a new folder in laravels storage system and set it to private and download the file from there. You could even delete the file after the download:
'csv' => [
'driver' => 'local',
'root' => storage_path('csv'),
'visibility' => 'private',
],
Here is the create/download code:
$file = fopen('../storage/csv/file.csv', 'w');
fputcsv($file, $columns);
foreach($result as $res) {
fputcsv($file, array($res->from_user, $res->to_user, $res->message, $res->date_added));
}
fclose($file);
return response()->make(Storage::disk('csv')->get('file.csv'), 200, $headers);
This return will instantly delete the file after the download:
return response()->download(Storage::disk('csv')->path('file.csv'))
->deleteFileAfterSend(true);
What would be more secure? What is the better approach? I am currently leaning towards the second approach with the storage.

Option 1
Reasons:
you are not keeping the file, so persisting to disk has limited use
the data size is small, so download failures are unlikely, and if they happen, the processing time to recreate the output is minimal (I assume it's a quick SQL query behind the scenes?)
keeping the file in storage creates opportunities for the file to replicate, an incremental backup or rsync that you may setup in the future could replicate the sensitive files before they get deleted...
deleting the file from the filesystem does not necessarily make the data unrecoverable
If you were dealing with files that are tens/hundreds of MB, I'd be thinking differently...

Let's think about all options,
Option 1 is good solution because you are not storing the file. It will be more secure than others. But timeout can be a problem at high traffic.
Option 2 also good solution with delete. But you need to create files with unique names so you can use parallel downloads.
Option 3 is like option 2 but if you are using laravel don't use it. (And think about 2 people are downloading at the same time)
After this explanation, you need to work on option 1 to make it more secure if you are using one server. But if you are using microservices you need work on option 2.
I can suggest one more thing to make it secure. Create a unique hashed URL. For example, use timestamp and hash it with laravel and check them before URL. So people can't download again from download history.
https://example.com/download?hash={crypt(timestamp+1min)}
If it is not downloaded in 1 min URL will be expired.

I think the reply depends on the current architecture and the size of file to download
(1)st approach is applicable when:
If the files are small (less then 10 Mb) Thanks #tanerkay
you have simple architecture (e.g. 1 server)
Reasons:
no download failures -- no need to retry
keep it simple
no files = no backups and no rsync and no additional places to steal it
.
.
.
(2)nd approach is applicable when:
If your files are big (10+ Mb)
If you already have microservices architecture with multiple balance loaders -- keep the similarity
If you have millions users trying to download -- you just can't service them without balance loader and parallel downloading
Reasons:
The second approach is definitely more SCALABLE and so more stable under high loading, so more secure. Microservices are more time consuming and more scalable for heavy loading.
The usage of separate file storage allows you in the future to have separate file server and balance loaded and the queue manager and separate dedicated access control.
If the content is important, it usually means that to get it is very important for the user.
But direct output with headers can hang or get a timeout error and so on.
Keeping the file until it would be downloaded is much more sure approach of delivering it I think.
Still, I consider expiration time instead or additionally to the downloading fact -- the download process can fail, or the file is lost (ensure 1+ hour availability) or vice versa the user will try to download it only after 1 year or never -- why should you keep this file for more than N days?

I think first option
The first approach is to use php output buffer php://:
is more secure then other where you're not storing file anywhere.

How to design the backend to put requests into a queue?

I want to design such a system that users can submit files. After they submit a file I will run some scripts with the files. I want to run these files in order so I want to maintain a queue of requests. How can I do this with php? Is there any open source library for this?
Thanks!

I would use a database..
When user submits a file it adds a link to it's location in the file system to the database.
You have a cron running that checks for new submissions and processes them in order, when done it marks it as processed in the database.

I would use Redis. Redis is a superfast key-value storage; usually its response time is double digit microseconds. (10 - 99 microseconds)
Redis transactions are atomic (transactions either happen or they don't), and you can have a job constantly running without using cron.
To use Redis with PHP, you can use Predis.
Once redis is installed, and Predis is set up to work with your script, upon uploading the file I would do something like this:
// 'hostname' and 'port' is the hostname and the port
// where Redis is installed and listening to.
$client = new Predis\Client('tcp://hostname:port');
// assuming the path to the file is stored in $pathToFile
$client->lpush('queue:files', $pathToFile);
then the script that needs to work with the files, just needs to do something like:
$client = new Predis\Client('tcp://hostname:port');
// assuming the path to the file is stored in $pathToFile
while(true){
$pathToFile = $client->rpop('queue:files');
if(!$pathToFile) {
// list is empty, so move on.
continue;
}
// there was something in the list, do whatever you need to do with it.
// if there's an exception or an error, you can always use break; or exit; to terminate the loop.
}
Take in consideration that PHP tends to use a lot of memory, so I would either explicitly collect the garbage (via gc_enable() and gc_collect_cycles() and unset() variables as you go).
Alternatively, you can use software like supervisord to make this script run just once, and as soon as it ends, start it up again.
In general, I would stay away from using a database and cron to implement queues. It can lead to serious problems.
Assume, for example, you use a table as a queue.
In the first run, your script pulls a job from the database and start doing its thing.
Then for some reason, your script takes longer to run, and the cron job kicks in again, and now you have 2 scripts working the same file. This could either have no consequences, or could have serious consequences. That depends on what your application is actually doing.
So unless you're working with a very small dataset, and you know for sure that your cronjob will finish before the previous script ran, and there will be no collisions, then you should be fine. Otherwise, stay away from that approach.

Third party Library? Too simple to need an entire Library. You could use Redis (see answer by AlanChavez) if you want to waste time and resources and then have to be concerned about garbage collection, when the real solution is not to bring in garbage into the mix in the first place.
Your queue is a text file. When a file is uploaded, the name of the file is appended to the queue.
$q= fopen('queue.txt','a');
The 'a' mode is important. It automatically moves the write pointer to the end of the file for append writes. But the reason it is important is because if the file does not exist, a new one is created.
fwrite($q,"$filename\n");
fclose($q);
If there are simultaneous append writes to this file the OS with arbitrate the conflict without error. No need for file locking, cooperative multitasking, or transactional processing.
When your script that processes the queue begins to run it renames the live queue to a working queue.
if(!file_exists('q.txt')){
if(!file_exists('queue.txt')){exit;}
rename('queue.txt','q.txt');
$q = fopen(`q.txt`,'r');
while (($filename = fgets($q, 4096)) !== false) {
process($filename);
}
fclose($q);
unlink('q.txt');
}
else{
echo 'Houston, we have a problem';
}
Now you see why the 'a' mode was important. We rename the queue and when the next upload occurs, the queue.txt is automatically created.
If the file is being written to as it is being rename the OS will sort it out without error. The rename is so fast that the chance of a simultaneous write is astronomically small. And it is a basic OS feature to sort out file system contention. No need for file locking, cooperative multitasking, or transactional processing.
This is a bullet proof method I have been using for many years.
Replace the Apollo 13 quote with an error recovery routine. If q.txt exists the previous processing did not complete successfully.
That was too easy.
Because it is so simple and we have plenty of memory because we are so efficient: Let's have some fun.
Let's see if writing to the queue is faster than AlanChavez's "super fast" Redis with its only double digit millisecond response.
The time in seconds to add a file name to the queue = 0.000014537 or 14.5µS. Slightly better than Redis's "super fast" 10-99 mS Response Time by, at minimum, 100,000%.

Two users write to a file at the same time? (PHP/file_put_contents)

If I write data to a file via file_put_contents with the FILE_APPEND flag set and two users submit data at the same time, will it append regardless, or is there a chance one entry will be overwritten?
If I set the LOCK_EX flag, will the second submission wait for the first submission to complete, or is the data lost when an exclusive lock can't be obtained?
How does PHP generally handle that? I'm running version 5.2.9. if that matters.
Thanks,
Ryan

you could also check the flock function to implement proper locking (not based on the while / sleep trick)

If you set an exclusive file lock via LOCK_EX, the second script (time-wise) that attempts to write will simply return false from file_put_contents.
i.e.: It won't sit and wait until the file becomes available for writing.
As such, if so required you'll need to program in this behaviour yourself, perhaps by attempting to use file_put_contents a limited number of times (e.g.: 3) with a suitably sized usage of sleep between each attempt.

Integer number truncated when writing to text file in PHP?

I wrote a download counter:
$hit_count = #file_get_contents('download.txt');
$hit_count++;
#file_put_contents('download.txt', $hit_count);
header('Location: file/xxx.zip');
As simple as that. The problem is the stats number is truncated to 4 digits thus not showing the real count:
http://www.converthub.com/batch-image-converter/download.txt
The batch image converter program gets downloaded a couple hundred times per day and the PHP counter has been in place for months. The first time I found out about this was about 2 months ago when I was very happy that it hit 8000 mark after a few weeks yet a week after that it was 500 again. And it happened again and again.
No idea why. Why?

You're probably suffering a race condition in the filesystem, you're attempting to open and read a file, then open the same file and write to it. The operating system may not have fully released its original lock on the file when you close it for reading then open it for writing again straight away. If the site is as busy as you say, then you could even have issues of multiple instances of your script trying to access the file at the same time
Failing that, do all your file operations in one go. If you use fopen (), flock (), fread (), rewind (), fwrite () and fclose () to handle the hit counter updating you can avoid having to close the file and open it again. If you use r+ mode, you'll be able to read the value, increment it, and write the result back in one go.
None of this can completely guarantee that you won't hit issues with concurrent accesses though.
I'd strongly recommend looking into a different approach to implementing your hit counter, such as a database driven counter.

Always do proper error handling, don't just suppress errors with #. In this case, it is probable that the file_get_contents has failed as the file was being written at the time. Thus, $hit_count is set to FALSE, and $hit_count++ makes it 1. So your counter gets randomly reset to 1 whenever the reading fails.
If you insist on writing the number to a file, do proper error checking and only write to the file if you are SURE you got the file open.
$hit_count = file_get_contents('download.txt');
if($hit_count !== false) {
$hit_count++;
file_put_contents('download.txt', $hit_count);
}
header('Location: file/xxx.zip');
It will still fail occasionally, but at least it will not truncate your counter.

This is a kind of situation where having a database record the visits (which would allow for greater data-mining as it could be trended by date, time, referrer, location, etc) would be a better solution than using a counter in a flat file.
A cause may be that you are having a collision between a read and write action on a file (happening once every 8,000 instances or so). Adding the LOCK_EX flag to the file_get_contents() PHP Reference action may prevent this, but I could not be 100% certain.
Better to look at recording the data into a database, as that is almost certain to prevent your current problem of losing count.

PHP and concurrent file access

I'm building a small web app in PHP that stores some information in a plain text file. However, this text file is used/modified by all users of my app at some given point in time and possible at the same time.
So the questions is. What would be the best way to make sure that only one user can make changes to the file at any given point in time?

You should put a lock on the file
$fp = fopen("/tmp/lock.txt", "r+");
if (flock($fp, LOCK_EX)) { // acquire an exclusive lock
ftruncate($fp, 0); // truncate file
fwrite($fp, "Write something here\n");
fflush($fp); // flush output before releasing the lock
flock($fp, LOCK_UN); // release the lock
} else {
echo "Couldn't get the lock!";
}
fclose($fp);
Take a look at the http://www.php.net/flock

My suggestion is to use SQLite. It's fast, lightweight, stored in a file, and has mechanisms for preventing concurrent modification. Unless you're dealing with a preexisting file format, SQLite is the way to go.

You could do a commit log sort of format, sort of how wikipedia does.
Use a database, and every saved change creates a new row in the database, that makes the previous record redundant, with an incremented value, then you only have to worry about getting table locks during the save phase.
That way at least if 2 concurrent people happen to edit something, both changes will appear in the history and whatever one lost out to the commit war can be copied into the new revision.
Now if you don't want to use a database, then you have to worry about having a revision control file backing every visible file.
You could put a revision control ( GIT/MERCURIAL/SVN ) on the file system and then automate commits during the save phase,
Pseudocode:
user->save :
getWritelock();
write( $file );
write_commitmessage( $commitmessagefile ); # <-- author , comment, etc
call "hg commit -l $commitmessagefile $file " ;
releaseWriteLock();
done.
At least this way when 2 people make critical commits at the same time, neither will get lost.

A single file for many users really shouldn't be the strategy you use I don't think - otherwise you'll probably need to implement a single (global) access point that monitors if the file is currently being edited or not. Aquire a lock, do your modification, release the lock etc. I'd go with 'Nobody's suggestion to use a database (SQLite if you don't want the overhead of a fully decked out RDBMS)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.