The problem I'm facing is that I have a SplFileObject that reads a CSV file, then I wrap it into a LimitIterator.
The CSV file has 2 rows, header, and 1 row, I set the LimitIterator offset to 1 and I can't reach over it, seems some internal problems with the indexing.
here is the code:
$csvFile = new \SplFileObject($fileName);
$csvFile->setFlags(\SplFileObject::READ_CSV);
$iterator = new \LimitIterator($csvFile, 1);
/** #var \LimitIterator $it */
foreach ($iterator as $it) {
list($.., $.., $..) = $it;
$result[] = [
...
];
}
return $result;
The code works well if the CSV has 3 lines (header and 2 rows), but only with header and 1 row, it doesn't work.
Any Idea?
Thanks.
This is related to PHP bug 65601, which suggests adding the READ_AHEAD flag will fix this. Tested and works as you expected it to.
$csvFile->setFlags(\SplFileObject::READ_CSV | \SplFileObject::READ_AHEAD);
See also SplFileObject + LimitIterator + offset
Personally, I would use the fopen and fgetcsv functions for this as below;
<?php
$handle = fopen("myFile.csv", "r");
if ($handle != false)
{
$header_row_handled = false;
while ($datarow = fgetcsv($handle, 0, ",")
{
if (!$header_row_handled)
{
$header_row_handled = true;
}
else
{
/* Whatever you need */
$dataparta = $datarow[0];
$datapartb = $datarow[1];
$datapartc = $datarow[2];
}
}
}
Looping means you can just skip the first row each time and perhaps use less resource for this as you don't have as many objects floating around
The php.net page should help you out too
Related
I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?
Server Specs: Core i7 with 32GB of Ram
function parse_csv($file_name, $delimeter=',') {
$header = false;
$row_count = 0;
$data = [];
// clear any previous results
reset_parse_csv();
// parse
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
$data[] = array_combine($header, $row);
$row_count++;
}
}
fclose($file);
return ['data' => $data, 'row_count' => $row_count];
}
function reset_parse_csv() {
$header = false;
$row_count = 0;
$data = [];
}
Iterating over a large dataset (file lines, etc.) and pushing into array it increases memory usage and this is directly proportional to the number of items handling.
So the bigger file, the bigger memory usage - in this case.
If it's desired a function to formatting the CSV data before processing it, backing it on the of generators sounds like a great idea.
Reading the PHP doc it fits very well for your case (emphasis mine):
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which
may cause you to exceed a memory limit, or require a considerable
amount of processing time to generate.
Something like this:
function csv_read($filename, $delimeter=',')
{
$header = [];
$row = 0;
# tip: dont do that every time calling csv_read(), pass handle as param instead ;)
$handle = fopen($filename, "r");
if ($handle === false) {
return false;
}
while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
if (0 == $row) {
$header = $data;
} else {
# on demand usage
yield array_combine($header, $data);
}
$row++;
}
fclose($handle);
}
And then:
$generator = csv_read('rdu-weather-history.csv', ';');
foreach ($generator as $item) {
do_something($item);
}
The major difference here is:
you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv
It is not necessary to write a generator function. The SplFileObject also works fine.
$fileObj = new SplFileObject($file);
$fileObj->setFlags(SplFileObject::READ_CSV
| SplFileObject::SKIP_EMPTY
| SplFileObject::READ_AHEAD
| SplFileObject::DROP_NEW_LINE
);
$fileObj->setCsvControl(';');
foreach($fileObj as $row){
//do something
}
I tried that with the file "rdu-weather-history.csv" (> 500KB). memory_get_peak_usage() returned the value 424k after the foreach loop. The values must be processed line by line.
If a 2-dimensional array is created, the storage space required for the example increases to more as 8 Mbytes.
One thing you could possibly attempt, is a Bulk Import to MySQL which may give you a better platform to work from once it's imported.
LOAD DATA INFILE '/home/user/data.csv' INTO TABLE CSVImport; where CSVimport columns match your CSV.
Bit of a left field suggestion, but depending on what your use case is can be a better way to parse massive datasets.
I'm having a trouble when tried to use array_combine in a foreach loop. It will end up with an error:
PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 85 bytes) in
Here is my code:
$data = array();
$csvData = $this->getData($file);
if ($columnNames) {
$columns = array_shift($csvData);
foreach ($csvData as $keyIndex => $rowData) {
$data[$keyIndex] = array_combine($columns, array_values($rowData));
}
}
return $data;
The source file CSV which I've used has approx ~1,000,000 rows. This row
$csvData = $this->getData($file)
I was using a while loop to read CSV and assign it into an array, it's working without any problem. The trouble come from array_combine and foreach loop.
Do you have any idea to resolve this or simply have a better solution?
UPDATED
Here is the code to read the CSV file (using while loop)
$data = array();
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
$data[] = $rowData;
}
fclose($fh);
return $data;
UPDATED 2
The code above is working without any problem if you are playing around with a CSV file <=20,000~30,000 rows. From 50,000 rows and up, the memory will be exhausted.
You're in fact keeping (or trying to keep) two distinct copies of the whole dataset in your memory. First you load the whole CSV date into memory using getData() and the you copy the data into the $data array by looping over the data in memory and creating a new array.
You should use stream based reading when loading the CSV data to keep just one data set in memory. If you're on PHP 5.5+ (which you definitely should by the way) this is a simple as changing your getData method to look like that:
protected function getData($file) {
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
yield $rowData;
}
fclose($fh);
}
This makes use of a so-called generator which is a PHP >= 5.5 feature. The rest of your code should continue to work as the inner workings of getData should be transparent to the calling code (only half of the truth).
UPDATE to explain how extracting the column headers will work now.
$data = array();
$csvData = $this->getData($file);
if ($columnNames) { // don't know what this one does exactly
$columns = null;
foreach ($csvData as $keyIndex => $rowData) {
if ($keyIndex === 0) {
$columns = $rowData;
} else {
$data[$keyIndex/* -1 if you need 0-index */] = array_combine(
$columns,
array_values($rowData)
);
}
}
}
return $data;
I have several files to parse (with PHP) in order to insert their respective content in different database tables.
First point : the client gave me 6 files, 5 are CSV with values separated by coma ; The last one do not come from the same database and its content is tabulation-based.
I built a FileParser that uses SplFileObject to execute a method on each line of the file-content (basically, create an Entity with each dataset and persist it to the database, with Symfony2 and Doctrine2).
But I cannot manage to parse the tabulation-based text file with SplFileObject, it does not split the content in lines as I expect it to do...
// In my controller context
$parser = new MyAmazingFileParser();
$parser->parse($filename, $delimitor, function ($data) use ($em) {
$e = new Entity();
$e->setSomething($data[0);
// [...]
$em->persist($e);
});
// In my parser
public function parse($filename, $delimitor = ',', $run = null) {
if (is_callable($run)) {
$handle = new SplFileObject($filename);
$infos = new SplFileInfo($filename);
if ($infos->getExtension() === 'csv') {
// Everything is going well here
$handle->setCsvControl(',');
$handle->setFlags(SplFileObject::DROP_NEW_LINE + SplFileObject::READ_AHEAD + SplFileObject::SKIP_EMPTY + SplFileObject::READ_CSV);
foreach (new LimitIterator($handle, 1) as $data) {
$result = $run($data);
}
} else {
// Why does the Iterator-way does not work ?
$handle->setCsvControl("\t");
// I have tried with all the possible flags combinations, without success...
foreach (new LimitIterator($handle, 1) as $data) {
// It always only gets the first line...
$result = $run($data);
}
// And the old-memory-killing-dirty-way works ?
$fd = fopen($filename, 'r');
$contents = fread($fd, filesize($filename));
foreach (explode("\t", $contents) as $line) {
// Get all the line as I want... But it's dirty and memory-expensive !
$result = $run($line);
}
}
}
}
It is probably related with the horrible formatting of my client's file, but after a long discussion with them, they really cannot get another format for me, for some acceptable reasons (constraints in their side), unfortunately.
The file is currently long of 49459 lines, so I really think the memory is important at this step ; So I have to make the SplFileObject way working, but do not know how.
An extract of the file can be found here :
Data-extract-hosted
I wanted to find best way to make translation framework (gettext have some imperfections).
So I make two tests - One, parsing file contains static text by code below
function parseLine($line) {
if($line[0] == '#' || !strlen($line))
return array();
$eq = strpos($line, '=');
$key = trim(substr($line, 0, $eq));
$value = trim(substr($line, $eq+1));
$value = trim($value, '"');
return array($key => $value);
}
$table = array();
$fp = fopen('lang.lng', 'r');
while(!feof($fp)) {
$table += parseLine(fgets($fp, 4096));
}
fclose($fp);
and secondly including array
$table = include('lang.php');
Of course each lang.lng and lang.php has same data (1000 records) but reperesented in diffrent way.
I was surpised when I saw results...
First method: ~0.01 s
Second: ~0.001 s
Before test I was sure that including array will take more memory and time then parsing file.
Could somebody explain me where is mistake?
I would have thought that was a no-brainer. Which is faster, reading in a file that contains an array, or reading in a file, processing it line-by-line, and for each line look for various tokens and components and piece them all together in a complex matter to result in the same array as the include method?
I have a php code that will read and parse csv files into a multiline array, what i need to do next is to take this array and let simplehtmldom fire off a crawler to return some company stocks info.
The php code for the CSV parser is
$arrCSV = array();
// Opening up the CSV file
if (($handle = fopen("NASDAQ.csv", "r")) !==FALSE) {
// Set the parent array key to 0
$key = 0;
// While there is data available loop through unlimited times (0) using separator (,)
while (($data = fgetcsv($handle, 0, ",")) !==FALSE) {
// Count the total keys in each row $data is the variable for each line of the array
$c = count($data);
//Populate the array
for ($x=0;$x<$c;$x++) {
$arrCSV[$key][$x] = $data[$x];
}
$key++;
} // end while
// Close the CSV file
fclose($handle);
} // end if
echo "<pre>";
echo print_r($arrCSV);
echo "</pre>";
This works great and parses the array line by line, $data being the variable for each line. What i need to do now is to get this to be read via simplehtmldom, which is where it breaks down, im looking at using this code or something very similar, im pretty inexperienced at this but guess i would be needing a foreach statement somewhere along the line.
This is the simplehtmldom code
$html = file_get_html($data);
$html->find('div[class="detailsDataContainerLt"]');
$tickerdetails = ("$es[0]");
$FileHandle2 = fopen($data, 'w') or die("can't open file");
fwrite($FileHandle2, $tickerdetails);
fclose($FileHandle2);
fclose($handle);
So my qyestion is how can i get them both working together, i jave checked out simplehtmldom manual page several times and find it a littlebit vague in this area, the simplehtmldom code above is what i use in another function but by direclty linking so i know that it works.
regards
Martin
Your loop could be reduced to (yes, it's the same):
while ($data = fgetcsv($handle, 0, ',')) {
$arrCSV[] = $data;
}
Using SimpleXML instead of SimpleDom (Since it's standard PHP):
foreach ($arrCSV as $row) {
$xml = simplexml_load_file($row[0]); // Change 0 to the index of the url
$result = $xml->xpath('//div[contains(concat(" ", #class, " "), " detailsDataContainerLt")]');
if ($result->length > 0) {
$file = fopen($row[1], '2'); // Change 1 to the filename you want to write to
if ($file) {
fwrite($file, (string) $result->item(0));
fclose($file);
}
}
}
that should do it if I understood correctly...