Opening and reading a 2GB csv - php

I have a been having problems opening and reading the contents of a 2gb csv file. Everytime I run the script it exhausts the servers memory (10GB VPS Cloud Server) and then gets killed. I have made a test script and was wondering if anyone could have a look and confirm that I am not doing anything silly (php wise) here that would cause what seems and unsually high amount of memory usage. I have spoken to my hosting company but they seem to be of the opinion that it is a code problem. So just wondering if anyone can look over this and confirm there is nothing in the code that would cause this kind of problem.
Also if you deal with 2GB csvs, have you encounted anything like this before ?
Thanks
Tim
<?php
ini_set("memory_limit", "10240M");
$start = time();
echo date("Y-m-d H:i:s", $start)."\n";
$file = 'myfile.csv';
$lines = $keys = array();
$line_count = 0;
$csv = fopen($file, "r");
if(!empty($csv))
{
echo "file open \n";
while(($csv_line = fgetcsv($csv, null, ',', '"')) !== false)
{
if($line_count==0) {
foreach($csv_line as $item) {
$keys[] = preg_replace("/[^a-zA-Z0-9]/", "", $item);
}
} else {
$array = array();
for ($i = 0; $i <count($csv_line); $i++) {
$array[$keys[$i]] = $csv_line[$i];
}
$lines[] = (object) $array;
//print_r($array);
//echo "<br/><br/>";
}
$line_count++;
}
if ($line_count == 0) {
echo "invalid csv or wrong delimiter / enclosure ".$file;
}
} else {
echo "cannot open ".$file;
}
fclose ($csv);
echo $line_count . " rows \n";
$end = time();
echo date("Y-m-d H:i:s", $end)."\n";
$time = number_format((($end - $start)/60), 2);
echo $time."\n";
echo "peak memory usages ".memory_get_peak_usage(true)."\n";

it is not actually an "opening" problem but rather processing problem
I am sure you don't need to keep all the parsed lines in the memory like you currently do.
Why not just put the parsed line wherever it belongs to - a database or another file or anything?
It will make your code to keep in the memory as little as just one line at a time.

As others have already pointed out, you're loading the whole 2 GB file into memory. You do this while creating an array with multiple strings out of each line, so factually the resulting memory needed is more than the plain file-size.
You might want to process each row of the CSV file separately, ideally with an iterator, for example one that returns each line as a keyed array:
$csv = new CSVFile('../data/test.csv');
foreach ($csv as $line) {
var_dump($line);
}
Exemplary output here:
array(3) {
["Make"]=> string(5) "Chevy"
["Model"]=> string(4) "1500"
["Note"]=> string(6) "loaded"
}
array(3) {
["Make"]=> string(5) "Chevy"
["Model"]=> string(4) "2500"
["Note"]=> string(0) ""
}
array(3) {
["Make"]=> string(5) "Chevy"
["Model"]=> string(0) ""
["Note"]=> string(6) "loaded"
}
This iterator is inspired by one that's build in in PHP called SPLFileObject. As this is an iterator, you decide what you do with each line's/row's data. See the related question: Process CSV Into Array With Column Headings For Key
class CSVFile extends SplFileObject
{
private $keys;
public function __construct($file)
{
parent::__construct($file);
$this->setFlags(SplFileObject::READ_CSV);
}
public function rewind()
{
parent::rewind();
$this->keys = parent::current();
parent::next();
}
public function current()
{
return array_combine($this->keys, parent::current());
}
public function getKeys()
{
return $this->keys;
}
}

PHP is really the wrong language for this. String manipulation usually results in copies of strings being allocated in memory, and garbage collection will only occur when the script ends, when it is really no longer needed. If you know how to do it, and it fits the execution environment, you'd be better with perl or sed/awk.
Having said this, there are two memory hogs on the script. The first is the foreach, which copies the array. Do a foreach on the array_keys, and refer back to the string entry in the array to get at the lines. The second, is the one referred by #YourCommonSense: you should design your algorithm so it works in streaming mode (i.e. not requiring the storage of the full dataset in memory). At a cursory glance, it seems feasible.

Related

How to parse a csv file that contains 15 million lines of data in php

I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?
Server Specs: Core i7 with 32GB of Ram
function parse_csv($file_name, $delimeter=',') {
$header = false;
$row_count = 0;
$data = [];
// clear any previous results
reset_parse_csv();
// parse
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
$data[] = array_combine($header, $row);
$row_count++;
}
}
fclose($file);
return ['data' => $data, 'row_count' => $row_count];
}
function reset_parse_csv() {
$header = false;
$row_count = 0;
$data = [];
}
Iterating over a large dataset (file lines, etc.) and pushing into array it increases memory usage and this is directly proportional to the number of items handling.
So the bigger file, the bigger memory usage - in this case.
If it's desired a function to formatting the CSV data before processing it, backing it on the of generators sounds like a great idea.
Reading the PHP doc it fits very well for your case (emphasis mine):
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which
may cause you to exceed a memory limit, or require a considerable
amount of processing time to generate.
Something like this:
function csv_read($filename, $delimeter=',')
{
$header = [];
$row = 0;
# tip: dont do that every time calling csv_read(), pass handle as param instead ;)
$handle = fopen($filename, "r");
if ($handle === false) {
return false;
}
while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
if (0 == $row) {
$header = $data;
} else {
# on demand usage
yield array_combine($header, $data);
}
$row++;
}
fclose($handle);
}
And then:
$generator = csv_read('rdu-weather-history.csv', ';');
foreach ($generator as $item) {
do_something($item);
}
The major difference here is:
you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv
It is not necessary to write a generator function. The SplFileObject also works fine.
$fileObj = new SplFileObject($file);
$fileObj->setFlags(SplFileObject::READ_CSV
| SplFileObject::SKIP_EMPTY
| SplFileObject::READ_AHEAD
| SplFileObject::DROP_NEW_LINE
);
$fileObj->setCsvControl(';');
foreach($fileObj as $row){
//do something
}
I tried that with the file "rdu-weather-history.csv" (> 500KB). memory_get_peak_usage() returned the value 424k after the foreach loop. The values ​​must be processed line by line.
If a 2-dimensional array is created, the storage space required for the example increases to more as 8 Mbytes.
One thing you could possibly attempt, is a Bulk Import to MySQL which may give you a better platform to work from once it's imported.
LOAD DATA INFILE '/home/user/data.csv' INTO TABLE CSVImport; where CSVimport columns match your CSV.
Bit of a left field suggestion, but depending on what your use case is can be a better way to parse massive datasets.

How to remove unwanted characters/text - php

When I retrieve a file, it outputs:
...22nlarray(3) { [0]=> string(62) "/public_html/wp/wp-content/plugins/AbonneerProgrammas/Albums/." [1]=> string(63) "/public_html/wp/wp-content/plugins/AbonneerProgrammas/Albums/.." [2]=> string(69) "/public_html/wp/wp-content/plugins/AbonneerProgrammas/Albums/22nl.mp3" }
I, however, only want the 22nl displayed. I do not want the rest over there. How can I do that? Is there a function that deletes the rest of the output except 22nl (which is a filename)?
My PHP code:
// get contents of the current directory
$contents = ftp_nlist($conn_id, $destination_folder);
foreach ($contents as $mp3_url) {
$filename = basename($mp3_url, ".mp3");
echo "<a href='$mp3_url'>$filename</a>";
}
var_dump($contents);
There are similar questions to mine, however, they did not provide a good answer for me.
Greetings,
Rezoo Aftib
We can say that your value from the array will be
$value='string(69) "/public_html/wp/wp-content/plugins/AbonneerProgrammas/Albums/22nl.mp3"';
So, you can make that to export your file name:
$value = explode('"',$value);
$exploded = explode('/', $value[1]);
$full_mp3_name = end($exploded);
$just_name = explode(".",$full_mp3_name);
$just_name = $just_name[0];
When you print $full_mp3_name you will need to have 22nl.mp3
When you print $just_name you will need to have 22nl
It will better with functions but its example if it will be option for you.

PHP CSV-Upload UTF-8 (with and without BOM)

Can someone perhaps explain me the difference - and how to recognize or change the format?
I've a simple HTML-Upload-Form and after uploading I parse the file contents with fgetcsv(). After parsing I've an array like this
array(2) {
[0]=>
array(9) {
["OrderId"]=>
string(13) "FG-456887"
["Product"]=>
string(7) "B9876"
}
[1]=>
array(9) {
["OrderId"]=>
string(13) "FG-852562"
["Product"]=>
string(7) "B9877"
}
}
var_dump() shows me (apparently) exactly the same dump, when using files with or without BOM, but when I make a simple loop over this array and check if the OrderId (first field in the CSV) is empty - this always fails, when the CSV is encoded without BOM. When I save the same file with BOM - everything works fine.
foreach ($data as $position) {
$orderid = $position["OrderId"];
if (empty($orderid)) die('No orderid found');
}
And it is only the first field - the other fields are ok.
Found it myself. Don't know, if it's elegant - but it works...
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
function csv_to_array($filename='', $delimiter=';', $seperator = '"') {
if(!file_exists($filename) || !is_readable($filename))
return FALSE;
$csvdata = file($filename);
$header = NULL;
$data = array();
foreach ($csvdata as $line) {
$row = remove_utf8_bom($line);
$row = str_getcsv($row,$delimiter,$seperator);
if(!$header)
$header = $row;
else
$data[] = array_combine($header, $row);
}
return $data;
}
Background:
Unbeknownst to me I was in the same situation. I only realized it when I could not use the data that I imported from csv files.
Problem:
While importing two columns from a CSV file I could not access the data in the first column in the array:
array() => ['project_nr' => '0000000', 'project_name']
I tried:
array_keys($myArray);
And it worked as expected, but not until further analysis did I see that the first column 'project_nr' was 13 characters and not 10 characters. Which I later realized was BOM being read in.
Solution:
$str = file_get_contents('yourfile.utf8.csv');
$bom = pack("CCC", 0xef, 0xbb, 0xbf);
if (0 === strncmp($str, $bom, 3)) {
echo "BOM detected - file is UTF-8\n";
$str = substr($str, 3);
}
Reference:
Here is where I found the solution
Anecdote:
I placed this solution here in hopes of connecting google searches for not being able to access specific keys in an array to BOM UTF8 CSV upload.(which is what I needed and was not able to find) I hope that perhaps it may be of help to some desperately searching soul.

Variable contents from a file disappears and loop doesnt enter

I have the following code to read from a file, and write back to it after some computation.
if(file_exists(CACHE_FILE_PATH)) {
//read the cache and delete that line!
$inp = array();
$cache = fopen(CACHE_FILE_PATH, 'r');
if($cache) {
while(!feof($cache)) {
$tmp = fgets($cache);
//some logic with $tmp
$inp[] = $tmp;
}
fclose($cache);
}
var_dump($inp);
$cache = fopen(CACHE_FILE_PATH, 'w');
var_dump($inp);
if($cache) {
var_dump($inp);
foreach ($inp as $val) {
echo "\nIN THE LOOP";
fwrite($val."\n");
}
fclose($cache);
}
}
The output of the var_dumps is:
array(3) {
[0]=>
string(13) "bedupako|714
"
[1]=>
string(16) "newBedupako|624
"
[2]=>
string(19) "radioExtension|128
"
}
array(3) {
[0]=>
string(13) "bedupako|714
"
[1]=>
string(16) "newBedupako|624
"
[2]=>
string(19) "radioExtension|128
"
}
array(3) {
[0]=>
string(13) "bedupako|714
"
[1]=>
string(16) "newBedupako|624
"
[2]=>
string(19) "radioExtension|128
"
}
Even though its an array, it is not going in the loop and printing IN THE LOOP! Why?
This part of your code:
fwrite($val."\n");
Should be:
fwrite($cache, $val); // the "\n" is only required if it was stripped off after fgets()
The first argument to fwrite() must be a file descriptor opened with fopen().
Of course, if you had turned on error_reporting(-1) and ini_set('display_errors', 'On') during development you would have spotted this immediately :)
As suggested in the comments, you should try to simplify your code by using constructs like file() to read the whole file into an array of lines and then use join() and file_put_contents() to write the whole thing back.
If you just want a cache of key/value pairs, you could look into something like this:
// to read, assuming the cache file exists
$cache = include CACHE_FILE_PATH;
// write back cache
file_put_contents(CACHE_FILE_PATH, '<?php return ' . var_export($cache, true) . ';');
It reads and writes files containing data structures that PHP itself can read (a lot faster than you can).

$file->eof() always returning false when using PHP's SplFileObject in 'r' mode

Why is my PHP script hanging?
$path = tempnam(sys_get_temp_dir(), '').'.txt';
$fileInfo = new \SplFileInfo($path);
$fileObject = $fileInfo->openFile('a');
$fileObject->fwrite("test line\n");
$fileObject2 = $fileInfo->openFile('r');
var_dump(file_exists($path)); // bool(true)
var_dump(file_get_contents($path)); // string(10) "test line
// "
var_dump(iterator_count($fileObject2)); // Hangs on this
If I delete the last line (iterator_count(...) and replace it with this:
$i = 0;
$fileObject2->rewind();
while (!$fileObject2->eof()) {
var_dump($fileObject2->eof());
var_dump($i++);
$fileObject2->next();
}
// Output:
// bool(false)
// int(0)
// bool(false)
// int(1)
// bool(false)
// int(2)
// bool(false)
// int(3)
// bool(false)
// int(4)
// ...
The $fileObject->eof() always returns false so I get an infinite loop.
Why are these things happening? I need to get a line count.
Why are these things happening?
You are experiencing a peculiarity in the way that the SplFileObject class is written. Without calling next() and current() methods—using the default (0) flags—the iterator never moves forward.
The iterator_count() function never calls current(); it checks valid() and calls next() only. Your bespoke loops only call one or other of current() and next().
This should be considered a bug (whether in PHP itself, or a failure in the documentation) and the following code should (and does not) work as expected. I invite you to report this misbehaviour.
// NOTE: This currently runs in an infinite loop!
$file = new SplFileObject(__FILE__);
var_dump(iterator_count($file));
Workarounds
One quick sidestep to get things moving is to set the READ_AHEAD flag on the object. This will cause the next() method to read the next available line.
$file->setFlags(SplFileObject::READ_AHEAD);
If, for any reason, you do not want the read ahead behaviour then you must call both next() and current() yourself.
Back to the original problem of two SplFileObjects
The following should now work as you expected, allowing appending to a file and reading its line count.
<?php
$info = new SplFileInfo(__FILE__);
$write = $info->openFile('a');
$write->fwrite("// new line\n");
$read = $info->openFile('r');
$read->setFlags(SplFileObject::READ_AHEAD);
var_dump(iterator_count($read));
EDITED 01
If what you need is the number of lines inside the file:
<?php
$path = tempnam(sys_get_temp_dir(), '').'.txt';
$fileInfo = new SplFileInfo($path);
$fileObject = $fileInfo->openFile('a+');
$fileObject->fwrite("Foo".PHP_EOL);
$fileObject->fwrite("Bar".PHP_EOL);
echo count(file($path)); // outputs 2
?>
EDITED 02
This is your code above, but without entering into infinite loops due to the file pointer:
<?php
$path = tempnam(sys_get_temp_dir(), '').'.txt';
$fileInfo = new SplFileInfo($path);
$fileObject = $fileInfo->openFile('a+');
$fileObject->fwrite("Foo".PHP_EOL);
$fileObject->fwrite("Bar");
foreach($fileObject as $line_num => $line) {
echo 'Line: '.$line_num.' "'.$line.'"'."<br/>";
}
echo 'Total Lines:' . $fileObject->key();
?>
Outputs
Line: 0 "Foo "
Line: 1 "Bar"
Total Lines:2
ORIGINAL ANSWER
The logic applied was a bit off. I simplified the code:
<?php
// set path to tmp with random file name
echo $path = tempnam(sys_get_temp_dir(), '').'.txt';
echo "<br/>";
// new object
$fileInfo = new \SplFileInfo($path);
// open to write
$fileObject = $fileInfo->openFile('a');
// write two lines
$fileObject->fwrite("Foo".PHP_EOL);
$fileObject->fwrite("Bar".PHP_EOL);
// open to read
$fileObject2 = $fileInfo->openFile('r');
// output contents
echo "File Exists: " .file_exists($path);
echo "<br/>";
echo "File Contents: " . file_get_contents($path);
echo "<br/>";
// foreach line get line number and line contents
foreach($fileObject2 as $line_num => $line) {
echo 'Line: '.$line_num;
echo ' With: "'.$line.'" is the end? '.($fileObject2->eof()?'yes':'no')."<br>";
}
?>
Outputs:
/tmp/EAdklY.txt
File Exists: 1
File Contents: Foo Bar
Line: 0 With: "Foo " is the end? no
Line: 1 With: "Bar " is the end? no
Line: 2 With: "" is the end? yes
While it may seem counter intuitive, with PHP 5.3.9, this:
<?php
$f = new SplFileObject('test.txt', 'r');
while (!$f->eof()) {
$f->next();
}
while be an infinite loop, and never exit.
The following will exit, when the end of the file is reached:
<?php
$f = new SplFileObject('test.txt', 'r');
while (!$f->eof()) {
$f->current();
}
So:
$i = 0;
$fileObject2->rewind();
while (!$fileObject2->eof()) {
var_dump($fileObject2->eof());
var_dump($i++);
$fileObject2->next();
}
should be rewritten as:
$fileObject2->rewind();
while (!$fileObject2->eof()) {
$fileObject2->current();
}
$i = $fileObject2->key();

Categories