Reading csv file with large number of fields

Reading csv file with large number of fields - php

I have csv file with 104 fields, but I need only 4 fields to use in mysql database. each file has about a million rows.
could somebody tell me efficient way to do this? reading each line to array takes long time.
thanks

You have to read every line in its entirety by definition. This is necessary to find the delimiter for the next record (i.e. the newline character). You only need to discard the data you have read that you don't need. E.g.:
$data = array();
$fh = fopen('data.csv', 'r');
$headers = fgetcsv($fh);
while ($row = fgetcsv($fh)) {
$row = array_combine($headers, $row);
$data[] = array_intersect_key($row, array_flip(array('foo', 'bar', 'baz')));
// alternatively, if you know the column index, something like:
// $data[] = array($row[1], $row[45], $row[60]);
}
This only retains the columns foo, bar and baz and discards the rest. The reading from file (fgetcsv) is about as fast as it gets. If you need it any faster, you'll have to implement your own CSV tokenizer and parser which skips over the columns you don't need without even temporarily storing them in memory; how much of a performance boost this brings vs. development time necessary to implement this bug free is very debatable.
simple excel macro can drop all unnecessary columns (100 out of 104)
within second. I am looking for similar solution.
That is because Excel, once a file is opened, has all data in memory and can act on it very quickly. For an accurate comparison you need to compare the time it takes to open the file in Excel + dropping of the columns, not just dropping the columns.

Related

how to order a big csv file with php?

i'm looking for an algorithm strategy. I have a csv file with 162 columns and 55000 lines.
I want to order the datas with one date (which is on column 3).
first i tried directly to put everything in an array, but memory explodes.
So i decided to :
1/ Put in an array the 3 first columns.
2/ Order this array with usort
3/ read the csv file to recover the other columns
4/ Add in a new csv file the complete line
5/ replace the line by an empty string on the readed csv file
//First read of the file
while(($data = fgetcsv($handle, 0,';')) !== false)
{
$tabLigne[$columnNames[0]] = $data[0];
$tabLigne[$columnNames[1]] = $data[1];
$tabLigne[$columnNames[2]] = $data[2];
$dateCreation = DateTime::createFromFormat('d/m/Y', $tabLigne['Date de Création']);
if($dateCreation !== false)
{
$tableauDossiers[$row] = $tabLigne;
}
$row++;
unset($data);
unset($tabLigne);
}
//Order the array by date
usort(
$tableauDossiers,
function($x, $y) {
$date1 = DateTime::createFromFormat('d/m/Y', $x['Date de Création']);
$date2 = DateTime::createFromFormat('d/m/Y', $y['Date de Création']);
return $date1->format('U')> $date2->format('U');
}
);
fclose($handle);
copy(PATH_CSV.'original_file.csv', PATH_CSV.'copy_of_file.csv');
for ($row = 3; $row <= count($tableauDossiers); $row++)
{
$handle = fopen(PATH_CSV.'copy_of_file.csv', 'c+');
$tabHandle = file(PATH_CSV.'copy_of_file.csv');
fgetcsv($handle);
fgetcsv($handle);
$rowHandle = 2;
while(($data = fgetcsv($handle, 0,';')) !== false)
{
if($tableauDossiers[$row]['Caisse Locale Déléguée'] == $data[0]
&& $tableauDossiers[$row]['Date de Création'] == $data[1]
&& $tableauDossiers[$row]['Numéro RCT'] == $data[2])
{
fputcsv($fichierSortieDossier, $data,';');
$tabHandle[$rowHandle]=str_replace("\n",'', $tabHandle[$rowHandle]);
file_put_contents(PATH_CSV.'copy_of_file.csv', $tabHandle);
unset($tabHandle);
break;
}
$rowHandle++;
unset($data);
unset($tabLigne);
}
fclose($handle);
unset($handle);
}
This algo is really too long to execute, but works
Any idea how to improve it ?
Thanks

Assuming you are limited to using PHP and can not use a database to implement it as suggested in the comments, the next best option is to use an external sorting algorithm.
Split the file into small files. The files should be small enough to sort them in memory.
Sort all these files individually in memory.
Merge the sorted files to one big file by comparing the first lines of each file.
The merging of the sorted files can be done very memory efficient: You only need to have the first line of each file in memory at any given time. The first line with the minimal timestamp should go to the resulting file.
For really big files you can cascade the merging ie: if you have 10,000 files you can merge groups of 100 files first then merge the resulting 100 files.
Example
I use a comma to separate values instead of line-breaks for readability.
The unsorted file (imagine it to be be too big to fit into memory):
1, 6, 2, 4, 5, 3
Split the files in parts that are small enough to fit into memory:
1, 6, 2
4, 5, 3
Sort them individually:
1, 2, 6
3, 4, 5
Now merge:
Compare 1 & 3 → take 1
Compare 2 & 3 → take 2
Compare 6 & 3 → take 3
Compare 6 & 4 → take 4
Compare 6 & 5 → take 5
Take 6.

You have a fairly large set of data to process, so you need to do something to optimize it.
You could increase your memory, but that will only postpone the error, when there is a bigger file, it'll crash then (or get waaaayy too slow).
The first option is try to minimize the amount of data. Remove all non-relevant columns from the file. Whichever solution you apply, a smaller dataset is always faster.
I suggest you put it into a database and apply your requirements to it, using that result to create a new file. A database is made to manage large data sets, so it'll take a whole lot less time.
Taking that much data and write that to a file from PHP will still be slow, but could be manageble. Another tactic might be using the commandline, using a .sh file. If you have do basic terminal/ssh skills, you have basic .sh writing capabilities. In that file, you can use mysqldump to export as csv like this. Mysqldump will be significantly faster, but it's a bit trickier to get going when you're used to PHP.
To improve your current code:
- The unset at the end of the first will don't do anything userful. They barely store data and get reset anyways when the next itteration of the while starts.
- Instead of DateTime() for everything, which is easier to work with, but slower, use epoch values. I dont know in what format it comes now, but if you use epoch seconds (like the result of time()), you have two numbers. Your usort() will improve drastically, as it no longer has to use the heavy DateTime class, but just a simple number comparison.
This all asumes that you need to do it muliple times. If not, just open it in Excel or Numbers, use that sort and save as copy.

I've only tried this on a small file, but the principle is very similar to your idea of reading the file, stores the dates and then sorts it. Then reading the original file and writing out the sorted data.
In this version, the load just reads the dates in and creates an array which holds the date and the position in the file of the start of the line (using ftell() after each read to get the file pointer).
It then sorts this array (as date is first just uses normal sort).
Then it goes through the sorted array and for each entry it uses fseek() to locate the record in the file and reads the line (using fgets()) and writes this line to the output file...
$file = "a.csv";
$out = "sorted.csv";
$handle = fopen($file, "r");
$tabligne = [];
$start = 0;
while ( $data = fgetcsv($handle) ) {
$tabligne[] = ['date' => DateTime::createFromFormat('d/m/Y', $data[2]),
'start' => $start ];
$start = ftell($handle);
}
sort($tabligne);
$outHandle = fopen( $out, "w" );
foreach ( $tabligne as $entry ) {
fseek($handle, $entry['start']);
$copy = fgets($handle);
fwrite($outHandle, $copy);
}
fclose($outHandle);
fclose($handle);

I would load the data in a database and let that worry about the underlying algorithm.
If this is a one-time issue, i would suggest not to automate it and use a spreadsheet instead.

How to avoid storing 1 million element in an array PHP

I'm parsing a 1 000 000 line csv file in PHP to recover this datas: IP Address, DNS , Cipher suites used.
In order to know if some DNS (having several mail servers) has different Cipher suites used on their servers, I have to store in a array a object containing the DNS name, a list of the IP Address of his servers, and a list of cipher suites he uses. At the end I have an array of 1 000 000 elements. To know the number of DNS having different cipher suites config on their servers I do:
foreach($this->allDNS as $dnsObject){
$res=0;
if(count($dnsObject->getCiphers()) > 1){ //if it has several different config
res++;
}
return $res;
}
Problem: Consumes too much memory, i can't run my code on 1000000 line csv (if I don't store these data in a array, I parse this csv file in 20 sec...). Is there a way to bypass this problem ?
NB: I already put
ini_set('memory_limit', '-1');
but this line just bypass the memory error.

Saving all of those CSV data will definitely take its toll on the memory.
One logical solution to your problem is to have a database that will store all of those data.
You may refer to this link for a tutorial on parsing your CSV file and storing it to database.

Write the processed Data (for each Line seperately) into one File (or Database)
file_put_contents('data.txt', $parsingresult, FILE_APPEND);
FILE_APPEND will append the $parsingresult at the End of the File-Content.
Then you can access the processed Data by file_get_contents() or file().
Anyways. I think, using a Database and some Pre-Processing would be the best Solution if this is needed more often.

You can use fgetcsv() to read and parse the CSV file one line at a time. Keep the data you need and discard the line:
// Store the useful data here
$data = array();
// Open the CSV file
$fh = fopen('data.csv', 'r');
// The first line probably contains the column names
$header = fgetcsv($fh);
// Read and parse one data line at a time
while ($row = fgetcsv($fh)) {
// Get the desired columns from $row
// Use $header if the order or number of columns is not known in advance
// Store the gathered info into $data
}
// Close the CSV file
fclose($fh);
This way it uses the minimum amount of memory needed to parse the CSV file.

Is it possible to load a single row from a CSV file?

Using PHP, is it possible to load just a single record / row from a CSV file?
In other words, I would like to treat the file as an array, but don't want to load the entire file into memory.
I know this is really what a database is for, but I am just looking for a down and dirty solution to use during development.
Edit: To clarify, I know exactly which row contains the info I am looking for.
I would just like to know if there is a way to get it without having to read the entire file into memory.

As I understand you are looking for a row with certain data. Therefore you could probably implement the following logic:
(1) scan file for the given data (ex. value which is in the row that you are trying to find),
(2) load only this line of file,
(3) perform your operations on that line.

fgetcsv() operates over a file resource handle, so if you want you can obtain the position of the line you can fseek() the resource to that position and use fgetcsv() normally.

If you don't know which line you are looking for until after you have read the row, your best bet is reading the record until you find the record by testing the array that is returned.
$fp = fopen('data.csv', 'r');
while(false !== ($data = fgetcsv($fp, 0, ','))) {
if ($data['field'] === 'somevalue') {
echo 'Hurray';
break;
}
}
If you are looking to read a specific line, use the splfile object and seek to the record number. This will return a string that you must convert to an array
$file = new SplFileObject('data.csv');
$file->seek(2);
$record = $file->current();
$data = explode(",", $record);

Read CSV from end to beginning in PHP

I am using PHP to expose vehicle GPS data from a CSV file. This data is captured at least every 30 seconds for over 70 vehicles and includes 19 columns of data. This produces several thousand rows of data and file sizes around 614kb. New data is appended to end of the file. I need to pull out the last row of data for each vehicle, which should represent the most the recent status. I am able to pull out one row for each unit, however since the CSV file is in chronological order I am pulling out the oldest data in the file instead of the newest. Is it possible to read the CSV from the end to the beginning? I have seen some solutions, however they typically involve loading the entire file into memory and then reversing it, this sounds very inefficient. Do I have any other options? Thank you for any advice you can offer.
EDIT: I am using this data to map real-time locations on-the-fly. The data is only provided to me in CSV format, so I think importing into a DB is out of the question.

With fseek you can set the pointer to the end of the file and offset it negative to read a file backwards.

If you must use csv files instead of a database, then perhaps you could read the file line-by-line. This will prevent more than the last line being stored in memory (thanks to the garbage collector).
$handle = #fopen("/path/to/yourfile.csv", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// old values of $last are garbage collected after re-assignment
$last = $line;
// you can perform optional computations on past data here if desired
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
// $last will now contain the last line of the file.
// You may now do whatever with it
}
edit: I did not see the fseek() post. If all you need is the last line, then that is the way to go.

php file random access and object to file saving

I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks

I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.