Parsing a large CSV in PHP

Parsing a large CSV in PHP - php

I have a large CSV I am writing a PHP CLI script to import into an existing PHP application, while utilizing the existing application's ORM to do relationship management between fields (trust me, I looked at a SQL CSV import, its easier this way). The issue at hand is that the CSV data I have is either malformed, or I have my fgetcsv() call wrong.
This is an example data set i'm aiming to import:
id,fname,lname,email\n
1,John,Public,johnqpublic#mailinator.com\n
1,Jane,Public,janeqpublic#mailinator.com\n
And the CSV Import code pretty much takes from the PHP Docs on fgetcsv():
function import_users($filepath) {
$row = 0;
$linesExecuted = 0;
if(($file = fopen($filepath, 'r')) !== false) {
$header = fgetcsv($file); //Loads line 1
while(($data = fgetcsv($file, 0, ",")) !== false) {
$userObj = user_record_generate_stdclass($data);
//A future method actually pipes the data through via the ORM
$row++;
}
fclose($file);
} else {
echo "It's going horribly wrong";
}
echo $row." records imported.";
}
The resulting logic of this method is pretty much a % sign, which is baffling. Am I overlooking something?

Related

Reading > 1GB GZipped CSV files from external FTP server

In a scheduled task of my Laravel application I'm reading several large gzipped CSV files, ranging from 80mb to 4gb on an external FTP server, containing products which I store in my database based on a product attribute.
I loop through a list of product feeds that I want to import but each time a fatal error is returned: 'Allowed memory size of 536870912 bytes exhausted'. I can bump up the length parameter of the fgetcsv function from 1000 to 100000 which solves the problem for the smaller files (< 500mb) but for the larger files it will return the fatal error.
Is there a solution that allows me to either download or unzip the .csv.gz files, reading the lines (by batch or one by one) and inserting the products into my database without running out of memory?
$feeds = [
"feed_baby-mother-child.csv.gz",
"feed_computer-games.csv.gz",
"feed_general-books.csv.gz",
"feed_toys.csv.gz",
];
foreach ($feeds as $feed) {
$importedProducts = array();
$importedFeedProducts = 0;
$csvfile = 'compress.zlib://ftp://' . config('app.ftp_username') . ':' . config('app.ftp_password') . '#' . config('app.ftp_host') . '/' . $feed;
if (($handle = fopen($csvfile, "r")) !== FALSE) {
$row = 1;
$header = fgetcsv($handle, 1, "|");
while (($data = fgetcsv($handle, 1000, "|")) !== FALSE) {
if($row == 1 || array(null) !== $data){ $row++; continue; }
$product = array_combine($header, $data);
$importedProducts[] = $product;
}
fclose($handle);
} else {
echo 'Failed to open: ' . $feed . PHP_EOL;
continue;
}
// start inserting products into the database below here
}

The problem is probably not the gzip file itself,
Of course you can download it, on process it then, this will keep the same issues.
Because you are loading all products in a single array (Memory)
$importedProducts[] = $product;
You could comment this line out, and see it if this prevent's hitting your memory limit.
Usually i would create a method like this addProduct($product) to handle it memory safe.
You can then from there decide a max number of products before doing a bulk insert. to achieve optimal speed.. i usually use something between 1000 en 5000 rows.
For example
class ProductBatchInserter
{
private $maxRecords = 1000;
private $records = [];
function addProduct($record) {
$this->records[] = $record;
if (count($this->records) >= $this->maxRecords) {
EloquentModel::insert($this->records);
$this->records = [];
}
}
}
However i usualy don't implement it as a single class, but in my projects i used to integrate them as a BulkInsertable trait that could be used on any eloquent model.
But this should give you an direction, how you can avoid memory limits.
Or, the easier , but significantly slower, just insert the row where you now assign it to array.
But that will put a ridiculous load on your database and will be really very slow.
If the GZIP stream is the bottleneck
As i expect this is not the issue, but if it would, then you could use gzopen()
https://www.php.net/manual/en/function.gzopen.php
and nest the gzopen handle as handle for fgetcsv.
But i expect the streamhandler you are using, is doing this already the same way for you..
If not, i mean like this:
$input = gzopen('input.csv.gz', 'r');
while (($row = fgetcsv($input)) !== false) {
// do something memory safe, like suggested above
}
If you need to download it anyway there are many ways to do it, but make sure you use something memory safe, like fopen / fgets , or a guzzle stream and don't try to use something like file_get_contents() that loads it into memory

How to parse a csv file that contains 15 million lines of data in php

I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?
Server Specs: Core i7 with 32GB of Ram
function parse_csv($file_name, $delimeter=',') {
$header = false;
$row_count = 0;
$data = [];
// clear any previous results
reset_parse_csv();
// parse
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
$data[] = array_combine($header, $row);
$row_count++;
}
}
fclose($file);
return ['data' => $data, 'row_count' => $row_count];
}
function reset_parse_csv() {
$header = false;
$row_count = 0;
$data = [];
}

Iterating over a large dataset (file lines, etc.) and pushing into array it increases memory usage and this is directly proportional to the number of items handling.
So the bigger file, the bigger memory usage - in this case.
If it's desired a function to formatting the CSV data before processing it, backing it on the of generators sounds like a great idea.
Reading the PHP doc it fits very well for your case (emphasis mine):
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which
may cause you to exceed a memory limit, or require a considerable
amount of processing time to generate.
Something like this:
function csv_read($filename, $delimeter=',')
{
$header = [];
$row = 0;
# tip: dont do that every time calling csv_read(), pass handle as param instead ;)
$handle = fopen($filename, "r");
if ($handle === false) {
return false;
}
while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
if (0 == $row) {
$header = $data;
} else {
# on demand usage
yield array_combine($header, $data);
}
$row++;
}
fclose($handle);
}
And then:
$generator = csv_read('rdu-weather-history.csv', ';');
foreach ($generator as $item) {
do_something($item);
}
The major difference here is:
you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv

It is not necessary to write a generator function. The SplFileObject also works fine.
$fileObj = new SplFileObject($file);
$fileObj->setFlags(SplFileObject::READ_CSV
| SplFileObject::SKIP_EMPTY
| SplFileObject::READ_AHEAD
| SplFileObject::DROP_NEW_LINE
);
$fileObj->setCsvControl(';');
foreach($fileObj as $row){
//do something
}
I tried that with the file "rdu-weather-history.csv" (> 500KB). memory_get_peak_usage() returned the value 424k after the foreach loop. The values must be processed line by line.
If a 2-dimensional array is created, the storage space required for the example increases to more as 8 Mbytes.

One thing you could possibly attempt, is a Bulk Import to MySQL which may give you a better platform to work from once it's imported.
LOAD DATA INFILE '/home/user/data.csv' INTO TABLE CSVImport; where CSVimport columns match your CSV.
Bit of a left field suggestion, but depending on what your use case is can be a better way to parse massive datasets.

Updating CSV Column Values Using UA-Parser Library

I'm using the ua-parser library to identify the device family for a number of user agent strings in a spreadsheet column. The problem I'm running into is that it doesn't seem like my function is really running. The value output for detectAgent($data[2]) is not always accurate.
Here's a code sample. I feel like I must be missing something related to the limitations of creating objects over and over again.
Thanks in advance for any help.
<?php
require_once 'vendor/autoload.php';
use UAParser\Parser;
function detectAgent($ua) {
$parser = Parser::create();
$result = $parser->parse($ua);
return $result->os->family;
}
$input_file = "input.csv";
$output_file = "output.csv";
if (($handle1 = fopen($input_file, "r")) !== FALSE) {
if (($handle2 = fopen($output_file, "w")) !== FALSE) {
while (($data = fgetcsv($handle1, 5000000, ",")) !== FALSE) {
// Alter your data
#print $data . "<br />";
$data[2] = detectAgent($data[2]); //identify browser family
// Write back to CSV format
fputcsv($handle2, $data);
}
fclose($handle2);
}
fclose($handle1);
}
?>

This was a silly mistake. I was writing to the wrong column in $data[2] = detectAgent($data[2]);.
If anyone else runs into the same problem, the code is working now and I've posted an example here.

Read .csv file and save its values in a list of arrays

I am new at php programming but I have been stuck with this code for some time.
I would like to read a .csv file line by line and then save its values in a list of arrays.
$file = fopen('Sub-Companies.csv', 'r');
while (($line =
fgetcsv($file)) !== FALSE) {
print_r($line);
list($customer_id[],$company_name[],$department[],$employee[],$country[],$zipcode[],$address[],$city[],
$smth1[], $smth2[], $phone_no1[],$phone_no2[],$email[],$website[],
$customer_no[],$problem1[],$problem2[]) = explode(";",$line); }
fclose($file); var_dump($customer_id);
The problem is that, although it is read correctly the file, then the explode is not working and the arrays appear to be null.
One thing that I am considering is that some arrays have more ";" than others, so that might be a problem, that is why I have the arrays $problem1 and $problem2, in order to store the values of this arrays.
Any help would be great!

You're using fgetcsv() in the wrong way.
We've come to this solution while chatting here on StackOverflow.
<?php
// Create file data.csv with your data
$handle = fopen('Sub-Companies.csv', 'r');
$customer_id = array();
$xyz_array = array();
// ...
// Better use a specified length (second parameter) instead of 0
// It slows down the whole process of reading the data!
while (($line = fgetcsv($handle, 0, ';')) !== FALSE) {
$customer_id[] = $line[0];
$xyz_array[] = $line[1];
}

how to import CSV using zend

How do I import CSV files using zend framework? Should I use zend_file_transfer or is there any special class that I have to look into? Also if I use zend_file_transfer is there any special validator for CSV?

you don't have to use any zend libraries to import csv files, you can just use native php functions, take a look at fgetcsv
$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num = count($data);
echo "<p> $num fields in line $row: <br /></p>\n";
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c] . "<br />\n";
}
}
fclose($handle);
}

You could also use SplFileObject for reading CSV files.
From the php manual:
<?php
$file = new SplFileObject("animals.csv");
$file->setFlags(SplFileObject::READ_CSV);
foreach ($file as $row) {
list($animal, $class, $legs) = $row;
printf("A %s is a %s with %d legs\n", $animal, $class, $legs);
}
?>
http://php.net/manual/en/splfileobject.fgetcsv.php

There is currently no way to do this with the Zend Framework. How can one be sure?
For example, Zend_Translate supports translation with CSV files, but if you check the the source code of the respective adapter (Zend_Translate_Adapter_Csv), you can verify it uses fgetcsv, and not a specific Zend class. Besides, this CSV adapter comes with the following warning:
Note: Beware that the Csv Adapter has
problems when your Csv files are
encoded differently than the locale
setting of your environment. This is
due to a Bug of PHP itself which will
not be fixed before PHP 6.0
(http://bugs.php.net/bug.php?id=38471).
So you should be aware that the Csv
Adapter due to PHP restrictions is not
locale aware.
which is related with the problems of the fgetcsv function.

Here's a function that reads a csv file and returns an array of items that contain the first two column data values.
This function could read a file of first_name,last_name for example.
function processFile ($filename) {
$rtn = array();
if (($handle = fopen($filename, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$item = array();
$item[] = $data[0];
$item[] = $data[1];
$rtn[] = $item;
}
}
return $rtn;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing a large CSV in PHP - php

Related

Reading > 1GB GZipped CSV files from external FTP server

How to parse a csv file that contains 15 million lines of data in php

Updating CSV Column Values Using UA-Parser Library

Read .csv file and save its values in a list of arrays

how to import CSV using zend

Categories

Resources