Segfault with PHP fread() on text file over 2GB - php

I have a php script which reads in a text file and does a count of all the lines in the file which match a specified regular expression. The script has worked well up until now as it segfaulted on the fread of a file over 2GB.
Actually before the segfault, I initially received the PHP Fatal Error: PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 2223941409 bytes).
To fix that I added this line to my script: ini_set('memory_limit', '4G');
That fixes the memory size exhausted error but I get the segfault on fread now.
Here's a condensed working version of the script which will exhibit the error:
#!/usr/bin/php
<?php
ini_set('memory_limit', '4G');
$file = $argv[1];
$fh = fopen($file, 'r');
$fsize = filesize($file);
print("SIZE: ".$fsize."\n" );
$myData = fread($fh, $fsize);
print("Got passed fread!\n");
fclose($fh);
preg_match_all( '/Z\t/', $myData, $sArray );
$scount = count($sArray,COUNT_RECURSIVE);
print("COUNT: ".$scount."\n");
?>
Sample output:
$ runtest.php testfile.txt
SIZE: 2223941408
Segmentation fault (core dumped)
Other info:
OS: CentOS release 6.7 (Final) x86_64
PHP 5.3.3 (cli) (built: Jul 9 2015 17:39:00) 64-bit

You're probably using a 32-bit PHP distribution. Under such architecture a PHP process cannot allocate more than 2 GB of RAM. In practice the upper limit is closer to 1GB than 2GB—the interpreter crashes way before getting to the 2 GB limit. Additionally, integer variables cannot be greater than PHP_INT_MAX which, in 32 builds, is as small as 2,147,483,647 (232-1).
This highlights two problems in your code:
$fsize = filesize($file);
... will not work if the file size is greater that PHP_INT_MAX.
Because PHP's integer type is signed and many platforms use 32bit integers, some filesystem functions may return unexpected results for files which are larger than 2GB.
$myData = fread($fh, $fsize);
... will crash for large files because you're loading the complete file contents in memory and then doing additional processing that will probably eat even more memory.
You'd better redesign your algorithm and read the file in small chunks (the task where fread() excels at). Counting the occurrences of a two-character substring should only need a few KB of RAM.
Here's a possible approach that assumes single byte encoding (as your code does):
// Ridiculously small value for illustration purposes, set to something bigger for better performance
define('CHUNK_SIZE', 4);
$fsize = $scount = 0;
$fh = fopen($file, 'r');
$possible_pending_match = false;
while (!feof($fh)) {
$chunk = fread($fh, CHUNK_SIZE);
$fsize += strlen($chunk);
$scount += substr_count($chunk, "Z\t");
if ($possible_pending_match && $chunk[0]==="\t") {
$scount++;
}
$possible_pending_match = substr($chunk, -1)==='Z';
}
print("SIZE: ".$fsize."\n" );
print("COUNT: ".$scount."\n");
print("MEMORY: ".memory_get_peak_usage(true)." bytes\n");
You'd need to add 1 to $scount to get the same result as your code, which counts one extra item for no evident reason—it feels like a bug but I don't know the specs.

Hi 2GB means that there is some internal 32-bit limitation in PHP. Are you running 32-bit PHP?
There is an alternative solution. You can do it with a very small memory overhead using a shell command called by PHP. The memory used would be no more than a couple of MB as grep and wc only load portions of the file into memory.
$lines = shell_exec("grep 'Z\t' $file | wc --lines");
grep: command to search files using regular expression
wc: command that returns the number of words/lines/chars

Related

PHP won't read full file into array, only partial

I have a file with 3,200,000 lines of csv data (with 450 columns). Total file size is 6 GB.
I read the file like this:
$data = file('csv.out');
Without fail, it only reads 897,000 lines. I confirmed with 'print_r', and echo sizeof($data). I increased my "memory_limit" to a ridiculous value like 80 GB but didn't make a difference.
Now, it DID read in my other large file, same number of lines (3,200,000) but only a few columns so total file size 1.1 GB. So it appears to be a total file size issue. FYI, 897,000 lines in the $data array is around 1.68 GB.
Update: I increased the second (longer) file to 2.1 GB (over 5 million lines) and it reads it in fine, yet truncates the other file at 1.68 GB. So does not appear to be a size issue. If I continue to increase the size of the second file to 2.2 GB, instead of truncating it and continuing the program (like it does for the first file), it dies and core dumps.
Update: I verified my system is 64 bit by printing integer and float numbers:
<?php
$large_number = 2147483647;
var_dump($large_number); // int(2147483647)
$large_number = 2147483648;
var_dump($large_number); // float(2147483648)
$million = 1000000;
$large_number = 50000 * $million;
var_dump($large_number); // float(50000000000)
$large_number = 9223372036854775807;
var_dump($large_number); //
int(9223372036854775807)
$large_number = 9223372036854775808;
var_dump($large_number); //
float(9.2233720368548E+18)
$million = 1000000;
$large_number = 50000000000000 * $million;
var_dump($large_number); // float(5.0E+19)
print "PHP_INT_MAX: " . PHP_INT_MAX . "\n";
print "PHP_INT_SIZE: " . PHP_INT_SIZE . " bytes (" . (PHP_INT_SIZE * 8) . " bits)\n";
?>
The output from this script is:
int(2147483647)
int(2147483648)
int(50000000000)
int(9223372036854775807)
float(9.2233720368548E+18)
float(5.0E+19)
PHP_INT_MAX: 9223372036854775807
PHP_INT_SIZE: 8 bytes (64 bits)
So since it's 64 bit, and memory limit is set really high, why is PHP not reading files > 2.15 GB?
Some things that come to mind:
If you're using a 32 bits PHP, you cannot read files that are larger than 2GB.
If reading the file takes too long, there could be time-outs.
If the file is really huge, then reading it all into memory is going to be problematic. It's usually better to read blocks of data and process that, unless you need random access to all parts of the file.
Another approach (i've used that in the past), is to chop the large file into smaller, more manageable ones (should work if it's a straightforwards log file for example)
I fixed it. All I had to do was change the way I read the files. Why...I do not know.
Old code that only reads 2.15 GB out of 6.0 GB:
$data = file('csv.out');
New code that reads the full 6.0 GB:
$data = array();
$i=1;
$handle = fopen('csv.out');
if ($handle) {
while (($data[$i] = fgets($handle)) !== false){
// process the line read
$i++;
}
Feel free to shed some light on why. There must be some limitation when using
$var=file();
Interestingly, 2.15 GB is close to the 32 bit limit I read about.

Downloading a large file in PHP, max 8192 bytes?

I'm using the following code to download a large file (>100mb). The code is executed in a shell.
$fileHandle = fopen($url, 'rb');
$bytes = 100000;
while ($read = #fread($fileHandle, $bytes)) {
debug(strlen($read));
if (!file_put_contents($filePath, $read, FILE_APPEND)) {
return false;
}
}
Where I would expect that debug(strlen($read)) would output 100000, this is the actual output:
10627
8192
8192
8192
...
Why doesn't fread read more than 8192 bytes after the first time, and why does it read 10627 bytes on the first iteration?
This makes downloading the file very slow, is there a better way to do this?
The answer to your question is (quoting from the PHP docs for fread()):
if the stream is read buffered and it does not represent a plain file, at most one read of up to a number of bytes equal to the chunk size (usually 8192) is made; depending on the previously buffered data, the size of the returned data may be larger than the chunk size
The solution to your performance problem is to using stream_copy_to_stream() which should be faster than block reading using fread(), and more memory efficient as well
I checked the manual, and found this: http://php.net/manual/en/function.fread.php
"If the stream is read buffered and it does not represent a plain file, at most one read of up to a number of bytes equal to the chunk size (usually 8192) is made;"
Since you're opening a URL this is probably the case.
It doesn't explain the 10627 though...
Besides that, why do you expect 100000 byte reads to be faster than 8192?
I doubt that's your bottle neck. My guess is that either the download speed from the URL or the writing speed of the HD is the problem.

PHP Excel Memory Limit of 2GB Exhausted reading a 286KB file

Addtitional info:
I'm running this from the command line. CentOS 6, 32GB Ram total, 2GB Memory for PHP.
I tried increasing the memory limit to 4GB, but now I get a Fatal error: String size overflow. PHP maximum string size is 2GB.
My code is very simple test code:
$Reader = new SpreadsheetReader($_path_to_files . 'ABC.xls');
$i = 0;
foreach ($Reader as $Row)
{ $i++;
print_r($Row);
if($i>10) break;
}
And it is only to print 10 rows. And that is taking 2 Gigabytes of memory?
The error is occuring at line 253 in excel_reader2.php
Inside class OLERead, inside function read($sFilenName)
Here is the code causing my exhaustion:
if ($this->numExtensionBlocks != 0) {
$bbdBlocks = (BIG_BLOCK_SIZE - BIG_BLOCK_DEPOT_BLOCKS_POS)/4;
}
for ($i = 0; $i < $bbdBlocks; $i++) { // LINE 253
$bigBlockDepotBlocks[$i] = GetInt4d($this->data, $pos);
$pos += 4;
}
I solved the problem. It turned out to be somewhat unrelated to the php code.
The program I am writing downloads .xls, .xlsx, and .csv files from email and FTP. The .xls file that was causing the memory overflow was downloaded in ASCII mode instead of Binary.
I changed my default to binary mode, and added a check that changes it to ASCII mode for .csv files.
I still find it strange that the program creates a 2GB string because of that. If there are no line breaks in the binary file, then I can see perhaps how the entire file might end up in one string. But the file is only 286KB. So, that's strange.

What's the best (most efficient) way to search for content in a file and change it with PHP? [duplicate]

This question already has answers here:
PHP what is the best way to write data to middle of file without rewriting file
(3 answers)
Closed 9 years ago.
I have a file that I'm reading with PHP. I want to look for some lines that start with some white space and then some key words I'm looking for (for example, "project_name:") and then change other parts of that line.
Currently, the way I handle this is to read the entire file into a string variable, manipulate that string and then write the whole thing back to the file, fully replacing the entire file (via fopen( filepath, "wb" ) and fwrite()), but this feels inefficient. Is there a better way?
Update: After finishing my function I had time to benchmark it. I've used a 1GB large file for testing but the results where unsatisfying :|
Yes, the memory peak allocation is significantly smaller:
standard solution: 1,86 GB
custom solution: 653 KB (4096 bytes buffersize)
But compared to the following solution there is just a slight performance boost:
ini_set('memory_limit', -1);
file_put_contents(
'test.txt',
str_replace('the', 'teh', file_get_contents('test.txt'))
);
the script above tooks ~16 seconds, the custom solution took ~13 seconds.
Resume: The custome solution is slight faster on large files and consumes much less memory(!!!).
Also if you want to run this in a web server environment the custom solution is better as many concurrent scripts would likely consume the whole available memory of the system.
Original Answer:
The only thing that comes in mind, is to read the file in chunks which fit the file systems block size and write the content or modified content back to a temporary file. After finish processing you use rename() to overwrite the original file.
This would reduce the memory peak and should be significantly faster if the file is really large.
Note: On a linux system you can get the file system block size using:
sudo dumpe2fs /dev/yourdev | grep 'Block size'
I got 4096
Here comes the function:
function freplace($search, $replace, $filename, $buffersize = 4096) {
$fd1 = fopen($filename, 'r');
if(!is_resource($fd1)) {
die('error opening file');
}
// the tempfile can be anywhere but on the same partition as the original
$tmpfile = tempnam('.', uniqid());
$fd2 = fopen($tmpfile, 'w+');
// we store len(search) -1 chars from the end of the buffer on each loop
// this is the maximum chars of the search string that can be on the
// border between two buffers
$tmp = '';
while(!feof($fd1)) {
$buffer = fread($fd1, $buffersize);
// prepend the rest from last one
$buffer = $tmp . $buffer;
// replace
$buffer = str_replace($search, $replace, $buffer);
// store len(search) - 1 chars from the end of the buffer
$tmp = substr($buffer, -1 * (strlen($search)) + 1);
// write processed buffer (minus rest)
fwrite($fd2, $buffer, strlen($buffer) - strlen($tmp));
};
if(!empty($tmp)) {
fwrite($fd2, $tmp);
}
fclose($fd1);
fclose($fd2);
rename($tmpfile, $filename);
}
Call it like this:
freplace('foo', 'bar', 'test.txt');

PHP using fwrite and fread with input stream

I'm looking for the most efficient way to write the contents of the PHP input stream to disk, without using much of the memory that is granted to the PHP script. For example, if the max file size that can be uploaded is 1 GB but PHP only has 32 MB of memory.
define('MAX_FILE_LEN', 1073741824); // 1 GB in bytes
$hSource = fopen('php://input', 'r');
$hDest = fopen(UPLOADS_DIR.'/'.$MyTempName.'.tmp', 'w');
fwrite($hDest, fread($hSource, MAX_FILE_LEN));
fclose($hDest);
fclose($hSource);
Does fread inside an fwrite like the above code shows mean that the entire file will be loaded into memory?
For doing the opposite (writing a file to the output stream), PHP offers a function called fpassthru which I believe does not hold the contents of the file in the PHP script's memory.
I'm looking for something similar but in reverse (writing from input stream to file). Thank you for any assistance you can give.
Yep - fread used in that way would read up to 1 GB into a string first, and then write that back out via fwrite. PHP just isn't smart enough to create a memory-efficient pipe for you.
I would try something akin to the following:
$hSource = fopen('php://input', 'r');
$hDest = fopen(UPLOADS_DIR . '/' . $MyTempName . '.tmp', 'w');
while (!feof($hSource)) {
/*
* I'm going to read in 1K chunks. You could make this
* larger, but as a rule of thumb I'd keep it to 1/4 of
* your php memory_limit.
*/
$chunk = fread($hSource, 1024);
fwrite($hDest, $chunk);
}
fclose($hSource);
fclose($hDest);
If you wanted to be really picky, you could also unset($chunk); within the loop after fwrite to absolutely ensure that PHP frees up the memory - but that shouldn't be necessary, as the next loop will overwrite whatever memory is being used by $chunk at that time.

Categories