I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}
Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}
EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?
If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}
Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}
allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done
Related
I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?
Server Specs: Core i7 with 32GB of Ram
function parse_csv($file_name, $delimeter=',') {
$header = false;
$row_count = 0;
$data = [];
// clear any previous results
reset_parse_csv();
// parse
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
$data[] = array_combine($header, $row);
$row_count++;
}
}
fclose($file);
return ['data' => $data, 'row_count' => $row_count];
}
function reset_parse_csv() {
$header = false;
$row_count = 0;
$data = [];
}
Iterating over a large dataset (file lines, etc.) and pushing into array it increases memory usage and this is directly proportional to the number of items handling.
So the bigger file, the bigger memory usage - in this case.
If it's desired a function to formatting the CSV data before processing it, backing it on the of generators sounds like a great idea.
Reading the PHP doc it fits very well for your case (emphasis mine):
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which
may cause you to exceed a memory limit, or require a considerable
amount of processing time to generate.
Something like this:
function csv_read($filename, $delimeter=',')
{
$header = [];
$row = 0;
# tip: dont do that every time calling csv_read(), pass handle as param instead ;)
$handle = fopen($filename, "r");
if ($handle === false) {
return false;
}
while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
if (0 == $row) {
$header = $data;
} else {
# on demand usage
yield array_combine($header, $data);
}
$row++;
}
fclose($handle);
}
And then:
$generator = csv_read('rdu-weather-history.csv', ';');
foreach ($generator as $item) {
do_something($item);
}
The major difference here is:
you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv
It is not necessary to write a generator function. The SplFileObject also works fine.
$fileObj = new SplFileObject($file);
$fileObj->setFlags(SplFileObject::READ_CSV
| SplFileObject::SKIP_EMPTY
| SplFileObject::READ_AHEAD
| SplFileObject::DROP_NEW_LINE
);
$fileObj->setCsvControl(';');
foreach($fileObj as $row){
//do something
}
I tried that with the file "rdu-weather-history.csv" (> 500KB). memory_get_peak_usage() returned the value 424k after the foreach loop. The values must be processed line by line.
If a 2-dimensional array is created, the storage space required for the example increases to more as 8 Mbytes.
One thing you could possibly attempt, is a Bulk Import to MySQL which may give you a better platform to work from once it's imported.
LOAD DATA INFILE '/home/user/data.csv' INTO TABLE CSVImport; where CSVimport columns match your CSV.
Bit of a left field suggestion, but depending on what your use case is can be a better way to parse massive datasets.
I have a large CSV file. Because of memory concerns (with MySQL), I would like to only read a part of it at a time, if possible.
That it's CSV might not be important. The important thing is that it needs to cut with a new line.
Example content:
Some CSV content
that will break
on a line break
This could be my path:
$path = 'path/to/my.csv';
A solution for it could in my mind look like this:
$csv_content1 = read_csv_file($path, 0, 100);
$csv_content2 = read_csv_file($path, 101, 200);
It reads the raw content on line 0-100.
It reads the raw content on line 101-200.
Information
No parsing is needed (just split into content).
The file exists on my own server.
Don't read the whole file into the memory.
I want to be able to do the second read on another time, not on the same run. I accept save temp values like pointers if needed.
I've been trying to read other topics but did not find an exact match to this problem.
Maybe some of these could somehow work?
SplFileObject
fgetcsv
Maybe I can't use $csv_content2 before I've used $csv_content1, because I need to save some kind of a pointer? In that case it's fine. I will read them in order anyway.
After much thinking and reading I finally think I found the solution to my problem. Correct me if this is a bad solution because of memory usage or from other perspectives.
First run
$buffer = part($path_to_file, 0, 100);
Next run
$buffer = part($path_to_file, $buffer['pointer'], 100);
Function
function part($path, $offset, $rows) {
$buffer = array();
$buffer['content'] = '';
$buffer['pointer'] = array();
$handle = fopen($path, "r");
fseek($handle, $offset);
if( $handle ) {
for( $i = 0; $i < $rows; $i++ ) {
$buffer['content'] .= fgets($handle);
$buffer['pointer'] = mb_strlen($buffer['content']);
}
}
fclose($handle);
return($buffer);
}
In my more object oriented environment it looks more like this:
function part() {
$handle = fopen($this->path, "r");
fseek($handle, $this->pointer);
if( $handle ) {
for( $i = 0; $i < 2; $i++ ) {
if( $this->pointer != $this->filesize ) {
$this->content .= fgets($handle);
}
}
$this->pointer += mb_strlen($this->content);
}
fclose($handle);
}
I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}
Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}
EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?
If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}
Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}
allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done
using the function below I am pulling rows from tables, encoding them, then putting them in csv format. I am wondering if there is an easier way to prevent high memory usage. I don't want to have to rely on ini_set. I believe the memory consumption is caused from reading the temp file and gzipping it up. I'd love to be able to have a limit of 64mb ram to work with. Any ideas? Thanks!
function exportcsv($tables) {
foreach ($tables as $k => $v) {
$fh = fopen("php://temp", 'w');
$sql = mysql_query("SELECT * FROM $v");
while ($row = mysql_fetch_row($sql)) {
$line = array();
foreach ($row as $key => $vv) {
$line[] = base64_encode($vv);
}
fputcsv($fh, $line, chr(9));
}
rewind($fh);
$data = stream_get_contents($fh);
$gzdata = gzencode($data, 6);
$fp = fopen('sql/'.$v.'.csv.gz', 'w');
fwrite($fp, $gzdata);
fclose($fp);
fclose($fh);
}
}
untested, but hopefully you understand
function exportcsv($tables) {
foreach ($tables as $k => $v) {
$fh = fopen('compress.zlib://sql/' .$v. '.csv.gz', 'w');
$sql = mysql_unbuffered_query("SELECT * FROM $v");
while ($row = mysql_fetch_row($sql)) {
fputcsv($fh, array_map('base64_encode', $row), chr(9));
}
fclose($fh);
mysql_free_result($sql);
}
}
edit-
points of interest are the use of mysql_unbuffered_query and use of php's compression stream. regular mysql_query() buffers entire result set into memory. and using the compression stream gets rid of having to buffer the data yet again into php memory as a string before writing to a file.
Pulling the whole file into memory via the stream_get_contents() is probably what's killing you. Not only are you having to hold the base64 data (which is usually about 33% than its raw content), you've got the csv overhead to deal with as well. If memory is a problem, consider simply calling a command-line gzip app instead of gzipping inside of PHP, something like:
... database loop here ...
exec('gzip yourfile.csv');
And you can probably optimize things a little bit better inside the DB loop, and encode in-place, rather than building a new array for each row:
while($row = mysql_fetch_row($result)) {
foreach ($row as $key => $val) {
$row[$key] = base64_encode($val);
fputcsv($fh, $row, chr(9));
}
}
Not that this will reduce memory usage much - it's only a single row of data, so unless you're dealing with huge record fields, it won't have much effect.
You could insert some flushing there, currently your entire php file will be held in memory then flushed at the end, however if you manually
fflush($fh);
Also instead of gzipping the entire file you could gzip line by line using
$gz = gzopen ( $fh, 'w9' );
gzwrite ( $gz, $content );
gzclose ( $gz );
This will write line by line packed data rather than creating an entire file and then gzipping it.
I found this suggestion for compressing in chunks on http://johnibanez.com/node/21
It looks like it wouldn't be hard to modify for your purposes.
function gzcompressfile($source, $level = false){
$dest = $source . '.gz';
$mode = 'wb' . $level;
$error = false;
if ($fp_out = gzopen($dest, $mode)) {
if ($fp_in = fopen($source, 'rb')) {
while(!feof($fp_in)) {
gzwrite($fp_out, fread($fp_in, 1024*512));
}
fclose($fp_in);
}
else
$error=true;
gzclose($fp_out);
}
else
$error=true;
if ($error)
return false;
else
return $dest;
}
I have a script which, each time is called, gets the first line of a file. Each line is known to be exactly of the same length (32 alphanumeric chars) and terminates with "\r\n".
After getting the first line, the script removes it.
This is done in this way:
$contents = file_get_contents($file));
$first_line = substr($contents, 0, 32);
file_put_contents($file, substr($contents, 32 + 2)); //+2 because we remove also the \r\n
Obviously it works, but I was wondering whether there is a smarter (or more efficient) way to do this?
In my simple solution I basically read and rewrite the entire file just to take and remove the first line.
I came up with this idea yesterday:
function read_and_delete_first_line($filename) {
$file = file($filename);
$output = $file[0];
unset($file[0]);
file_put_contents($filename, $file);
return $output;
}
There is no more efficient way to do this other than rewriting the file.
No need to create a second temporary file, nor put the whole file in memory:
if ($handle = fopen("file", "c+")) { // open the file in reading and editing mode
if (flock($handle, LOCK_EX)) { // lock the file, so no one can read or edit this file
while (($line = fgets($handle, 4096)) !== FALSE) {
if (!isset($write_position)) { // move the line to previous position, except the first line
$write_position = 0;
} else {
$read_position = ftell($handle); // get actual line
fseek($handle, $write_position); // move to previous position
fputs($handle, $line); // put actual line in previous position
fseek($handle, $read_position); // return to actual position
$write_position += strlen($line); // set write position to the next loop
}
}
fflush($handle); // write any pending change to file
ftruncate($handle, $write_position); // drop the repeated last line
flock($handle, LOCK_UN); // unlock the file
}
fclose($handle);
}
This will shift the first line of a file, you dont need to load the entire file in memory like you do using the 'file' function. Maybe for small files is a bit more slow than with 'file' (maybe but i bet is not) but is able to manage largest files without problems.
$firstline = false;
if($handle = fopen($logFile,'c+')){
if(!flock($handle,LOCK_EX)){fclose($handle);}
$offset = 0;
$len = filesize($logFile);
while(($line = fgets($handle,4096)) !== false){
if(!$firstline){$firstline = $line;$offset = strlen($firstline);continue;}
$pos = ftell($handle);
fseek($handle,$pos-strlen($line)-$offset);
fputs($handle,$line);
fseek($handle,$pos);
}
fflush($handle);
ftruncate($handle,($len-$offset));
flock($handle,LOCK_UN);
fclose($handle);
}
you can iterate the file , instead of putting them all in memory
$handle = fopen("file", "r");
$first = fgets($handle,2048); #get first line.
$outfile="temp";
$o = fopen($outfile,"w");
while (!feof($handle)) {
$buffer = fgets($handle,2048);
fwrite($o,$buffer);
}
fclose($handle);
fclose($o);
rename($outfile,$file);
I wouldn't usually recommend opening up a shell for this sort of thing, but if you're doing this infrequently on really large files, there's probably something to be said for:
$lines = `wc -l myfile` - 1;
`tail -n $lines myfile > newfile`;
It's simple, and it doesn't involve reading the whole file into memory.
I wouldn't recommend this for small files, or extremely frequent use though. The overhead's too high.
You could store positional info into the file itself. For example, the first 8 bytes of the file could store an integer. This integer is the byte offset of the first real line in the file.
So, you never delete lines anymore. Instead, deleting a line means altering the start position. fseek() to it and then read lines as normal.
The file will grow big eventually. You could periodically clean up the orphaned lines to reduce the file size.
But seriously, just use a database and don't do stuff like this.
Here's one way:
$contents = file($file, FILE_IGNORE_NEW_LINES);
$first_line = array_shift($contents);
file_put_contents($file, implode("\r\n", $contents));
There's countless other ways to do that also, but all the methods would involve separating the first line somehow and saving the rest. You cannot avoid rewriting the whole file. An alternative take:
list($first_line, $contents) = explode("\r\n", file_get_contents($file), 2);
file_put_contents($file, implode("\r\n", $contents));
My problem was large files. I just needed to edit, or remove the first line. This was a solution I used. Didn't require to load the complete file in a variable. Currently echos, but you could always save the contents.
$fh = fopen($local_file, 'rb');
echo "add\tfirst\tline\n"; // add your new first line.
fgets($fh); // moves the file pointer to the next line.
echo stream_get_contents($fh); // flushes the remaining file.
fclose($fh);
I think this is best for any file size
$myfile = fopen("yourfile.txt", "r") or die("Unable to open file!");
$ch=1;
while(!feof($myfile)) {
$dataline= fgets($myfile) . "<br>";
if($ch == 2){
echo str_replace(' ', ' ', $dataline)."\n";
}
$ch = 2;
}
fclose($myfile);
The solutions here didn't work performantly for me.
My solution grabs the last line (not the first line, in my case it was not relevant to get the first or last line) from the file and removes that from that file.
This is very quickly even with very large files (>150000000 lines).
function file_pop($file)
{
if ($fp = #fopen($file, "c+")) {
if (!flock($fp, LOCK_EX)) {
fclose($fp);
}
$pos = -1;
$found = 0;
while ($found < 2) {
if (fseek($fp, $pos--, SEEK_END) < 0) { // can not seek to position
rewind($fp); // rewind to the beginnung of the file
break;
};
if (ord(fgetc($fp)) == 10) { // newline
$found++;
}
}
$lastpos = ftell($fp); // get current position of file
$lastline = fgets($fp); // get current line
ftruncate($fp, $lastpos); // truncate file to last position
flock($fp, LOCK_UN); // unlock
fclose($fp); // close the file
return trim($lastline);
}
}
You could use file() method.
Gets the first line
$content = file('myfile.txt');
echo $content[0];