Reading large text files efficiently

Reading large text files efficiently - php

I have a couple of huge (11mb and 54mb) files that I need to read to process the rest of the script. Currently I'm reading the files and storing them in an array like so:
$pricelist = array();
$fp = fopen($DIR.'datafeeds/pricelist.csv','r');
while (($line = fgetcsv($fp, 0, ",")) !== FALSE) {
if ($line) {
$pricelist[$line[2]] = $line;
}
}
fclose($fp);
.. but I'm constantly getting memory overload messages from my webhost. How do I read it more efficiently?
I don't need to store everything, I already have the keyword which exactly matches the array key $line[2] and I need to read just that one array/line.

If you know the key why don't you filter out by the key? And you can check memory usage with memory_get_usage() function to see how much memory allocated after you fill your $pricelist array.
echo memory_get_usage() . "\n";
$yourKey = 'some_key';
$pricelist = array();
$fp = fopen($DIR.'datafeeds/pricelist.csv','r');
while (($line = fgetcsv($fp, 0, ",")) !== FALSE) {
if (isset($line[2]) && $line[2] == $yourKey) {
$pricelist[$line[2]] = $line;
break;
/* If there is a possiblity to have multiple lines
we can store each line in a separate array element
$pricelist[$line[2]][] = $line;
*/
}
}
fclose($fp);
echo memory_get_usage() . "\n";

You can try this (I have not checked if it works properly)
$data = explode("\n", shell_exec('cat filename.csv | grep KEYWORD'));
You will get all the lines containing the keyword, each line as an element of array.
Let me know if it helps.

I join what user2864740 said : "The problem is the in-memory usage caused by the array itself and is not about "reading" the file"
My Solution is :
Split your `$priceList` array
Load only 1 at time a splitted Array in memory
Keep the other splitted Arrays in an intermediate file
N.B: i did not test what i've written
<?php
define ("MAX_LINE", 10000) ;
define ("CSV_SEPERATOR", ',') ;
function intermediateBuilder ($csvFile, $intermediateCsvFile) {
$pricelist = array ();
$currentLine = 0;
$totalSerializedArray = 0;
if (!is_file()) {
throw new Exception ("this is not a regular file: " . $csv);
}
$fp = fopen ($csvFile, 'r');
if (!$fp) {
throw new Exception ("can not read this file: " . $csv);
}
while (($line = fgetcsv($fp, 0, CSV_SEPERATOR)) !== FALSE) {
if ($line) {
$pricelist[$line[2]] = $line;
}
if (++$currentLine == MAX_LINE) {
$fp2 = fopen ($intermediateCsvFile, 'a');
if (!$fp) throw new Exception ("can not write in this intermediate csv file: " . $intermediateCsvFile);
fputs ($fp2, serialize ($pricelist) . "\n");
fclose ($fp2);
unset ($pricelist);
$pricelist = array ();
$currentLine = 0;
$totalSerializedArray++;
}
}
fclose($fp);
return $totalSerializedArray;
}
/**
* #param array : by reference unserialized array
* #param integer : the array number to read from the intermediate csv file; start from index 1
* #param string : the (relative|absolute) path/name of the intermediate csv file
* #throw Exception
*/
function loadArray (&$array, $arrayNumber, $intermediateCsvFile) {
$currentLine = 0;
$fp = fopen ($intermediateCsvFile, 'r');
if (!$fp) {
throw new Exception ("can not read this intermediate csv file: " . $intermediateCsvFile);
}
while (($line = fgetcsv($fp, 0, CSV_SEPERATOR)) !== FALSE) {
if (++$currentLine == $arrayNumber) {
fclose ($fp);
$array = unserialize ($line);
return;
}
}
throw new Exception ("the array number argument [" . $arrayNumber . "] is invalid (out of bounds)");
}
Usage example
try {
$totalSerializedArray = intermediateBuilder ($DIR . 'datafeeds/pricelist.csv',
$DIR . 'datafeeds/intermediatePricelist.csv');
$priceList = array () ;
$arrayNumber = 1;
loadArray ($priceList,
$arrayNumber,
$DIR . 'datafeeds/intermediatePricelist.csv');
if (!array_key_exists ($key, $priceList)) {
if (++$arrayNumber > $totalSerializedArray) $arrayNumber = 1;
loadArray ($priceList,
$arrayNumber,
$DIR . 'datafeeds/intermediatePricelist.csv');
}
catch (Exception $e) {
// TODO : log the error ...
}

You can drop the
if ($line) {
That only repeats the check from the loop condition. If your file is 54MB, and you are going to retain every line from the file, as an array, plus the key from column 3 (which is hashed for lookup)... I could see that requiring 75-85MB to store it all in memory. That isn't much. Most wordpress or magento pages using widgets run 150-200MB. But if your host is set low it could be a problem.
You can try filtering out some rows by changing the if($line) to a if($line[1] == 'book') to reduce how much you store. But the only sure way to handle storing that much content in memory is to have that much memory available to the script.

You can try set bigger memory using this. You can change limit how you want.
ini_set('memory_limit', '2048M');
But also depents how you want that script use.

Related

Trying to hash all strings in the file after the comma separator

I have a file with keywords on each line. The line starts with number then comma and after that the keyword (like comma separated csv but text file). The file looks like this
7,n00t
41,n01
13,n021
21,n02
18,n03
13,n04
15,n05
13,n06
18,n07
13,n08
14,n09
9,n0a
What I'm trying is to run whole file and hash only the keywords without the number before the comma.
What I'm tried is this.
$savePath = "test-file.txt";
$handle = fopen($savePath, "r+");
if ($handle) {
while (($line = fgets($handle)) !== false) {
$hash1 = substr($line, strpos($line, ",") + 1);
$hash2 = hash('ripemd160', $hash1);
fwrite($handle, $hash2);
}
fclose($handle);
} else {
echo "Can't open the file!";
}
It is working but the problem is that it is hashing the number before the comma and on most of the lines I get one string. This is the output
d743dcc66de14a3430d806aad64a67345fd0b23d0007
75f32ebf42e3ffd70fc3f63d3a61fc6af0075c24000088
7b816ac9cbe2da6a6643538564216e441f55fe9f6,00009
f0ba52b83ffac69fddd8786d6f48e7700562f0170b
def75b09e253faea412f67e67a535595b00366dce
c0da025038ed83c687ddc430da9846ecb97f3998l
c0da025038ed83c687ddc430da9846ecb97f39985,0000r
c12530b4b78bde7bc000e4f15a15bcea013eaf8c
9c1185a5c5e9fc54612808977ee8f548b2258d31,00010
efa60a26277fde0514aec5b51f560a4ba25be3c111
0e25f9d48d432ff5256e6da30ab644d1ca726cb700123
ad6d049d794146aca7a169fd6cb3086954bf2c63012
Should be
7,d743dcc66de14a3430d806aad64a67345fd0b23d0007
41,75f32ebf42e3ffd70fc3f63d3a61fc6af0075c24000088
13,7b816ac9cbe2da6a6643538564216e441f55fe9f6,00009
21,f0ba52b83ffac69fddd8786d6f48e7700562f0170b
18,def75b09e253faea412f67e67a535595b00366dce
13,c0da025038ed83c687ddc430da9846ecb97f3998l
15,c0da025038ed83c687ddc430da9846ecb97f39985,0000r
13,c12530b4b78bde7bc000e4f15a15bcea013eaf8c
18,9c1185a5c5e9fc54612808977ee8f548b2258d31,00010
13,efa60a26277fde0514aec5b51f560a4ba25be3c111
14,0e25f9d48d432ff5256e6da30ab644d1ca726cb700123
9,ad6d049d794146aca7a169fd6cb3086954bf2c63012
Any ideas what is the problem?

The thing is you are reading and writing to the file at the same time. This way, internal pointer is being juggled all the time. Instead, read all the lines, store the result in an array and fseek the file pointer to the beginning of the file again and keep writing the new lines one by one as shown below.
Snippet:
<?php
$handle = fopen("test-file.txt", "r+");
if (!$handle) {
throw new Exception("Can't open file!");
}
$newLines = [];
while (($line = fgets($handle)) !== false) {
$hash = hash('ripemd160', substr($line, strpos($line, ",") + 1));
$newLines[] = substr($line, 0, strpos($line, ",")) . "," . $hash;
}
fseek($handle, 0);
foreach($newLines as $line){
fwrite($handle, $line . "\n");
}
fclose($handle);

The trouble is twofold:
You're trying to write to the file you're reading, while you're reading it without even changing the position of the file pointer, and vice-versa for the reads.
You're trying to overwrite 3-4 bytes of data with 32 bytes of data, which ends up clobbering most of the rest of the input you're trying to read.
If you want to change a file like this you need to create a new file, write your data to that, and then rename the new file to the old one.
Also, use fgetcsv() and fputcsv() to read and write CSV files, otherwise you're going to wind up fighting with edge cases when your input data starts getting complex.
$savePath = "test-file.txt";
$in_h = fopen($savePath, "r+");
$out_h = fopen($savePath.'.new', 'r+');
if ($in_h) {
while (($line = fgetcsv($in_h)) !== false) {
$line[1] = hash('ripemd160', $line[1]);
fputcsv($out_h, $line);
}
fclose($in_h);
fclose($out_h);
rename($savePath.'.new', $savePath);
} else {
echo "Can't open the file!";
}

PHP memory exhaused while using array_combine in foreach loop

I'm having a trouble when tried to use array_combine in a foreach loop. It will end up with an error:
PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 85 bytes) in
Here is my code:
$data = array();
$csvData = $this->getData($file);
if ($columnNames) {
$columns = array_shift($csvData);
foreach ($csvData as $keyIndex => $rowData) {
$data[$keyIndex] = array_combine($columns, array_values($rowData));
}
}
return $data;
The source file CSV which I've used has approx ~1,000,000 rows. This row
$csvData = $this->getData($file)
I was using a while loop to read CSV and assign it into an array, it's working without any problem. The trouble come from array_combine and foreach loop.
Do you have any idea to resolve this or simply have a better solution?
UPDATED
Here is the code to read the CSV file (using while loop)
$data = array();
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
$data[] = $rowData;
}
fclose($fh);
return $data;
UPDATED 2
The code above is working without any problem if you are playing around with a CSV file <=20,000~30,000 rows. From 50,000 rows and up, the memory will be exhausted.

You're in fact keeping (or trying to keep) two distinct copies of the whole dataset in your memory. First you load the whole CSV date into memory using getData() and the you copy the data into the $data array by looping over the data in memory and creating a new array.
You should use stream based reading when loading the CSV data to keep just one data set in memory. If you're on PHP 5.5+ (which you definitely should by the way) this is a simple as changing your getData method to look like that:
protected function getData($file) {
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
yield $rowData;
}
fclose($fh);
}
This makes use of a so-called generator which is a PHP >= 5.5 feature. The rest of your code should continue to work as the inner workings of getData should be transparent to the calling code (only half of the truth).
UPDATE to explain how extracting the column headers will work now.
$data = array();
$csvData = $this->getData($file);
if ($columnNames) { // don't know what this one does exactly
$columns = null;
foreach ($csvData as $keyIndex => $rowData) {
if ($keyIndex === 0) {
$columns = $rowData;
} else {
$data[$keyIndex/* -1 if you need 0-index */] = array_combine(
$columns,
array_values($rowData)
);
}
}
}
return $data;

Fatal Error - Out of Memory while reading a *.dat file in php [duplicate]

I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}

Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}

EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?

If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}

Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}

allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done

Split big file everytime </byebye> occurs

Below code splits my file every 10 lines, but I want it to split everytime
</byebye>
occurs. That way, I will get multiple files each containing;
<byebye>
*stuff here*
</byebye>
Code:
<?php
/**
*
* Split large files into smaller ones
* #param string $source Source file
* #param string $targetpath Target directory for saving files
* #param int $lines Number of lines to split
* #return void
*/
function split_file($source, $targetpath='files/', $lines=10){
$i=0;
$j=1;
$date = date("m-d-y");
$buffer='';
$handle = #fopen ($source, "r");
while (!feof ($handle)) {
$buffer .= #fgets($handle, 4096);
$i++;
if ($i >= $lines) {
$fname = $targetpath.".part_".$date.$j.".xml";
if (!$fhandle = #fopen($fname, 'w')) {
echo "Cannot open file ($fname)";
exit;
}
if (!#fwrite($fhandle, $buffer)) {
echo "Cannot write to file ($fname)";
exit;
}
fclose($fhandle);
$j++;
$buffer='';
$i=0;
$line+=10; // add 10 to $lines after each iteration. Modify this line as required
}
}
fclose ($handle);
}
split_file('testxml.xml')
?>
Any ideas?

If I understand you right, this should do it.
$content = file_get_contents($source);
$parts = explode('</byebye>', $content);
$parts = array_map('trim', $parts);
Then just write the parts to the different files
$dateString = date('m-d-y');
foreach ($parts as $index => $part) {
file_put_contents("{$targetpath}part_{$dateString}{$index}.xml", $part);
}
But I assume (without knowing your source), that this will result in invalid xml. You should use one of the XML-Parser (SimpleXML, DOM, ..) to handle xml-files.
Sidenote: You use # much much too much.

If you are worried about sizes you can switch to a file resource and use fread or fgets to control the amount of memory you are hitting.
$f = fopen($source, "r");
$out = '';
while (!feof($f))
{
$line .= fgets($f);
$arr = explode('</byebye>', $line);
$out .= $arr[0];
if (count($arr) == 1)
continue;
else
{
// file_put_contents here
// will need to handle lines with multiple </byebye> entries here,
// outputting as necessary
// replace $out with the final entry of the $arr array onto
}
}
You can also save more memory by opening up the file for output, and as you parse, pipe the contents to it. When you encounter a entry you would close the file and open the next one.

Least memory intensive way to read a file in PHP

I am reading a file containing around 50k lines using the file() function in Php. However, its giving a out of memory error since the contents of the file are stored in the memory as an array. Is there any other way?
Also, the lengths of the lines stored are variable.
Here's the code. Also the file is 700kB not mB.
private static function readScoreFile($scoreFile)
{
$file = file($scoreFile);
$relations = array();
for($i = 1; $i < count($file); $i++)
{
$relation = explode("\t",trim($file[$i]));
$relation = array(
'pwId_1' => $relation[0],
'pwId_2' => $relation[1],
'score' => $relation[2],
);
if($relation['score'] > 0)
{
$relations[] = $relation;
}
}
unset($file);
return $relations;
}

Use fopen, fread and fclose to read a file sequentially:
$handle = fopen($filename, 'r');
if ($handle) {
while (!feof($handle)) {
echo fread($handle, 8192);
}
fclose($handle);
}

EDIT after update of question and comments to answer of fabjoa:
There is definitely something fishy if a 700kb file eats up 140MB of memory with that code you gave (you could unset $relation at the end of the each iteration though). Consider using a debugger to step through it to see what happens. You might also want to consider rewriting the code to use SplFileObject's CSV functions as well (or their procedural cousins)
SplFileObject::setCsvControl example
$file = new SplFileObject("data.csv");
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl('|');
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
For an OOP approach to iterate over the file, try SplFileObject:
SplFileObject::fgets example
$file = new SplFileObject("file.txt");
while (!$file->eof()) {
echo $file->fgets();
}
SplFileObject::next example
// Read through file line by line
$file = new SplFileObject("misc.txt");
while (!$file->eof()) {
echo $file->current();
$file->next();
}
or even
foreach(new SplFileObject("misc.txt") as $line) {
echo $line;
}
Pretty much related (if not duplicate):
How to save memory when reading a file in Php?

If you don't know the maximum line length and you are not comfortable to use a magic number for the max line length then you'll need to do an initial scan of the file and determine the max line length.
Other than that the following code should help you out:
// length is a large number or calculated from an initial file scan
while (!feof($handle)) {
$buffer = fgets($handle, $length);
echo $buffer;
}

Old question but since I haven't seen anyone mentioning it, PHP generators is a great way to reduce save memory consumption.
For example:
function read($fileName)
{
$fileHandler = fopen($fileName, 'rb');
while(($line = fgets($fileHandler)) !== false) {
yield rtrim($line, "\r\n");
}
fclose($fileHandler);
}
foreach(read(__DIR__ . '/filenameHere') as $line) {
echo $line;
}

allocate more memory during the operation, maybe something like ini_set('memory_limit', '16M');. Don't forget to go back to initial memory allocation once operation is done

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Reading large text files efficiently - php

You can try this (I have not checked if it works properly) $data = explode("\n", shell_exec('cat filename.csv | grep KEYWORD')); You will get all the lines containing the keyword, each line as an element of array. Let me know if it helps.

You can try set bigger memory using this. You can change limit how you want. ini_set('memory_limit', '2048M'); But also depents how you want that script use.

Related

Trying to hash all strings in the file after the comma separator

PHP memory exhaused while using array_combine in foreach loop

Fatal Error - Out of Memory while reading a *.dat file in php [duplicate]

Split big file everytime </byebye> occurs

Least memory intensive way to read a file in PHP

Categories

Resources