Cut off beginning of a large file in PHP - php

I would like to cut off the beginning of a large file in PHP. Use of file_get_contents() is not possible due to memory restrictions.
What is the best way to delete the first $n characters from a file?
If it is possible to do it without creating a second file, I would prefer that solution.
Update After the file has been modified, it will be used by other scripts.

If you don't have enough memory to buffer the entire file, you'll need to create two files (at least temporarily) regardless of your solution.
Look into fseek(), which allows you to go to a particular byte position within a file.
// Open the file
$filename = 'somefile.dat';
$file = fopen($filename, 'r');
// Skip the first 1 KB
fseek($file, 1024);
// Your processing goes here...
// Close the file
fclose($file);
In your case, you could open the original file for reading and the temp file for writing concurrently. Seek the original file. Loop over the original file, reading a small chunk and writing it to temp. Then rename temp to have the same name as original.

Related

Which is the faster way to remove a list of rows from huge log file using PHP

I need to remove various useless log rows from a huge log file (200 MB)
/usr/local/cpanel/logs/error_log
The useless log rows are in array $useless
The way I am doing is
$working_log="/usr/local/cpanel/logs/error_log";
foreach($useless as $row)
{
if ($row!="") {
file_put_contents($working_log,
str_replace("$row","", file_get_contents($working_log)));
}
}
I need to remove about 65000 rows from the log file;
the code above does the job but it works slow, about 0.041 sec to remove each row.
Do you know a faster way to do this job using php ?
If the file can be loaded in memory twice (it seems it can if your code works) then you can remove all the strings from $useless in a single str_replace() call.
The documentation of str_replace() function explains how:
If search is an array and replace is a string, then this replacement string is used for every value of search.
$working_log="/usr/local/cpanel/logs/error_log";
file_put_contents(
$working_log,
str_replace($useless, '', file_get_contents($working_log))
);
When the file becomes too large to be processed by the code above you have to take a different approach: create a temporary file, read each line from the input file and write it to the temporary file or ignore it. At the end, move the temporary file over the source file:
$working_log="/usr/local/cpanel/logs/error_log";
$tempfile = "/usr/local/cpanel/logs/error_log.new";
$fin = fopen($working_log, "r");
$fout = fopen($tempfile, "w");
while (! feof($fin)) {
$line = fgets($fin);
if (! in_array($line, $useless)) {
fputs($fout, $line);
}
}
fclose($fin);
fclose($fout);
// Move the current log out of the way (keep it as backup)
rename($working_log, $working_log.".bak");
// Put the new file instead.
rename($tempfile, $working_log);
You have to add error handling (fopen(), fputs() may fail for various reasons) and code or human intervention to remove the backup file.

Why is PHP mixing two fopen handles on one file?

I noticed while testing two fopen() handles on one file, that the handles or channels mix, and the file contents empty when i call fread(). One handle is read, and one handle is write.
Example code:
$rh = fopen('existingfilewithcontent.txt', 'r');
$wh = fopen('existingfilewithcontent.txt', 'w');
echo fread($rh, 1000);
fclose($rh);
fclose($wh);
// file is now blank
This is tested on Linux & Windows.
I could not find anything in the PHP docs about it.
Please do not ask my why I would want two handles on one file as that is not the question.
Thankyou
Opening a file for write is destructive.
Manual says:
'w' - Open for writing only; place the file pointer at the beginning of the file and truncate the file to zero length. If the file does not exist, attempt to create it.
You probably want:
'r+' - Open for reading and writing; place the file pointer at the beginning of the file.
OR
'a+' - Open for reading and writing; place the file pointer at the end of the file. If the file does not exist, attempt to create it. In this mode, fseek() only affects the reading position, writes are always appended.
See manual: http://php.net/manual/en/function.fopen.php

About PHP parallel file read/write

Have a file in a website. A PHP script modifies it like this:
$contents = file_get_contents("MyFile");
// ** Modify $contents **
// Now rewrite:
$file = fopen("MyFile","w+");
fwrite($file, $contents);
fclose($file);
The modification is pretty simple. It grabs the file's contents and adds a few lines. Then it overwrites the file.
I am aware that PHP has a function for appending contents to a file rather than overwriting it all over again. However, I want to keep using this method since I'll probably change the modification algorithm in the future (so appending may not be enough).
Anyway, I was testing this out, making like 100 requests. Each time I call the script, I add a new line to the file:
First call:
First!
Second call:
First!
Second!
Third call:
First!
Second!
Third!
Pretty cool. But then:
Fourth call:
Fourth!
Fifth call:
Fourth!
Fifth!
As you can see, the first, second and third lines simply disappeared.
I've determined that the problem isn't the contents string modification algorithm (I've tested it separately). Something is messed up either when reading or writing the file.
I think it is very likely that the issue is when the file's contents are read: if $contents, for some odd reason, is empty, then the behavior shown above makes sense.
I'm no expert with PHP, but perhaps the fact that I performed 100 calls almost simultaneously caused this issue. What if there are two processes, and one is writing the file while the other is reading it?
What is the recommended approach for this issue? How should I manage file modifications when several processes could be writing/reading the same file?
What you need to do is use flock() (file lock)
What I think is happening is your script is grabbing the file while the previous script is still writing to it. Since the file is still being written to, it doesn't exist at the moment when PHP grabs it, so php gets an empty string, and once the later processes is done it overwrites the previous file.
The solution is to have the script usleep() for a few milliseconds when the file is locked and then try again. Just be sure to put a limit on how many times your script can try.
NOTICE:
If another PHP script or application accesses the file, it may not necessarily use/check for file locks. This is because file locks are often seen as an optional extra, since in most cases they aren't needed.
So the issue is parallel accesses to the same file, while one is writing to the file another instance is reading before the file has been updated.
PHP luckily has a mechanisms for locking the file so no one can read from it until the lock is released and the file has been updated.
flock()
can be used and the documentation is here
You need to create a lock, so that any concurrent requests will have to wait their turn. This can be done using the flock() function. You will have to use fopen(), as opposed to file_get_contents(), but it should not be a problem:
$file = 'file.txt';
$fh = fopen($file, 'r+');
if (flock($fh, LOCK_EX)) { // Get an exclusive lock
$data = fread($fh, filesize($file)); // Get the contents of file
// Do something with data here...
ftruncate($fh, 0); // Empty the file
fwrite($fh, $newData); // Write new data to file
fclose($fh); // Close handle and release lock
} else {
die('Unable to get a lock on file: '.$file);
}

Reading a file line by line for large files efficiently in PHP?

My app reads a large file 5MB - 10MB that has been entered in with json entries line by line.
Each line is handled by a parser that is fed to multiple parsers and treated separately. Once the file is read, the file is moved. The Program is continuously fed with files to be processed.
The program currently works with #file_get_contents($filename). The program's structure works as is.
The problem is that file_get_contents loads the entire file as an array and the entire system runs for a minute. I suspect that I can gain speed if I read it line by line rather than wait for the file to load into memory (I might be wrong and open to suggestion).
There are too many file handler that does this. What is the most effective way to achieve what I need and which file reading method is best for this?
I have fopen, fread, readfile, file, and fscanf to contend with off the top of my head. However when I read the man for them - its all code to read generic files without a clear indication what is best for larger files.
$file = fopen("file.json", "r");
if ($file)
{
while (($line = fgets($file)) !== false)
{
echo $line;
}
}
else
{
echo "Unable to open the file";
}
Fgets read until it reach EOL, or EOF. if you want, you can add how much it should read using the second arg.
For more info about fgets: http://us3.php.net/fgets

What's the best way to read from and then overwrite file contents in php?

What's the cleanest way in php to open a file, read the contents, and subsequently overwrite the file's contents with some output based on the original contents? Specifically, I'm trying to open a file populated with a list of items (separated by newlines), process/add items to the list, remove the oldest N entries from the list, and finally write the list back into the file.
fopen(<path>, 'a+')
flock(<handle>, LOCK_EX)
fread(<handle>, filesize(<path>))
// process contents and remove old entries
fwrite(<handle>, <contents>)
flock(<handle>, LOCK_UN)
fclose(<handle>)
Note that I need to lock the file with flock() in order to protect it across multiple page requests. Will the 'w+' flag when fopen()ing do the trick? The php manual states that it will truncate the file to zero length, so it seems that may prevent me from reading the file's current contents.
If the file isn't overly large (that is, you can be confident loading it won't blow PHP's memory limit), then the easiest way to go is to just read the entire file into a string (file_get_contents()), process the string, and write the result back to the file (file_put_contents()). This approach has two problems:
If the file is too large (say, tens or hundreds of megabytes), or the processing is memory-hungry, you're going to run out of memory (even more so when you have multiple instances of the thing running).
The operation is destructive; when the saving fails halfway through, you lose all your original data.
If any of these is a concern, plan B is to process the file and at the same time write to a temporary file; after successful completion, close both files, rename (or delete) the original file and then rename the temporary file to the original filename.
Read
$data = file_get_contents($filename);
Write
file_put_contents($filename, $data);
One solution is to use a separate lock file to control access.
This solution assumes that only your script, or scripts you have access to, will want to write to the file. This is because the scripts will need to know to check a separate file for access.
$file_lock = obtain_file_lock();
if ($file_lock) {
$old_information = file_get_contents('/path/to/main/file');
$new_information = update_information_somehow($old_information);
file_put_contents('/path/to/main/file', $new_information);
release_file_lock($file_lock);
}
function obtain_file_lock() {
$attempts = 10;
// There are probably better ways of dealing with waiting for a file
// lock but this shows the principle of dealing with the original
// question.
for ($ii = 0; $ii < $attempts; $ii++) {
$lock_file = fopen('/path/to/lock/file', 'r'); //only need read access
if (flock($lock_file, LOCK_EX)) {
return $lock_file;
} else {
//give time for other process to release lock
usleep(100000); //0.1 seconds
}
}
//This is only reached if all attempts fail.
//Error code here for dealing with that eventuality.
}
function release_file_lock($lock_file) {
flock($lock_file, LOCK_UN);
fclose($lock_file);
}
This should prevent a concurrently-running script reading old information and updating that, causing you to lose information that another script has updated after you read the file. It will allow only one instance of the script to read the file and then overwrite it with updated information.
While this hopefully answers the original question, it doesn't give a good solution to making sure all concurrent scripts have the ability to record their information eventually.

Categories