I read a big text file ~500MB and want to get the progress during my read operations.
To do so I now count the lines the files has and then compare it to the ones I already read. This needs two complete iterations over the file. Is there an easier way using the filesize and fgets buffer size?
My current code looks like:
$lineTotal = 0;
while ((fgets($handle)) !== false) {
$lineTotal++;
}
rewind($handle);
$linesDone = 0;
while (($line = fgets($handle)) !== false) {
progressBar($linesDone += 1, $lineTotal);
}
Based on bytes rather than lines, but you can quickly get the total size of the file upfront with filesize:
$bytesTotal = filesize("input.txt")
Then, after you've opened the file, you can read each line and then get your current position within the file, something like:
progressBar(0, $bytesTotal);
while (($line = fgets($handle)) !== false) {
doSomethingWith($line, 'presumably');
progressBar(ftell($handle), $bytesTotal);
}
There are caveats about the fact that PHP integers may not handle files over 2G but, since you specified your files are about 500M, that shouldn't be an immediate problem.
Related
I need help processing files holding about 46k lines or more than 30MB of data.
My original idea was to open the file and turn each line into an array element. This worked the first time as the array held about 32k values total.
The second time, the process was repeated, the array only held 1011 elements, and finally, the third time it could only hold 100.
I'm confused and don't know much about the backend array processes. Can someone explain what is happening and fix the code?
function file_to_array($cvsFile){
$handle = fopen($cvsFile, "r");
$path = fread($handle, filesize($cvsFile));
fclose($handle);
//Turn the file into an array and separate lines to elements
$csv = explode(",", $path);
//Remove common double spaces
foreach ($csv as $key => $line){
$csv[$key] = str_replace(' ', '', str_getcsv($line));
}
array_filter($csv);
//get the row count for the file and array
$rows = count($csv);
$filerows = count(file($cvsFile)); //this no longer works
echo "File has $filerows and array has $rows";
return $csv;
}
The approach here can be split in 2.
Optimized file reading and processing
Proper storage solution
Optimized file processing can be done like so:
$handle = fopen($cvsFile, "r");
$rowsSucceed = 0;
$rowsFailed = 0;
if ($handle) {
while (($line = fgets($handle)) !== false) { // Reading file by line
// Process CSV line and check if it was parsed correctly
// And count as you go
if (!empty($parsedLine)) {
$csv[$key] = ... ;
$rowsSucceed++;
} else {
$rowsFailed++;
}
}
fclose($handle);
} else {
// Error handling
}
$totalLines = $rowsSucceed + $rowsFailed;
Also you can avoid array_filter() simply by not adding processed line if its empty.
It will allow to optimize memory usage during script execution.
Proper storage
Proper storage here is needed for performing operations on certain amount of data. File reading are ineffective and expensive. Using simple file based database like sqlite can help you a lot and increase overall performance of your script.
For this purpose you probably should process your CSV directly to database and than perform count operation on parsed data avoiding excessive file line counts etc.
Also it gives you further advantage on working with data not keeping it all in memory.
Your question says you want to "turn each line into an array element" but that is definitely not what you are doing. The code is quite clear; it reads the entire file into $path and then uses explode() to make one massive flat array of every element on every line. Then later you're trying to run str_getcsv() on each item, which of course isn't going to work; you've already exploded all the commas away.
Looping over the file using fgetcsv() makes more sense:
function file_to_array($cvsFile) {
$filerows = 0;
$handle = fopen($cvsFile, "r");
while ($line = fgetcsv($handle)) {
$filerows++;
// skip empty lines
if ($line[0] === null) {
continue;
}
//Remove common double spaces
$csv[] = str_replace(' ', '', $line);
}
//get the row count for the file and array
$rows = count($csv);
echo "File has $filerows and array has $rows";
fclose($handle);
return $csv;
}
I am developing an application which has to read large CSV file and process data. It will be definitely not possible to make it in one request because processing the data also takes time, it is not just about reading.
So what I tried so far and what has been working well so far is the following:
// Open file
$handle = fopen($file, 'r');
// Move pointer to a place where it stopped last time
fseek($handle, $offset);
// Read limited line and process
for ($i = 0; $i < $limit; $i++) {
// Get length of line for offset purposes
$newlength = strlen(fgets($handle));
// Move pointer back. fgets moves pointer so we move it back for fgetcsv to get that line again
fseek($handle, $offset);
$line = fgetcsv($handle, 0, $csv_delimiter);
// Process data here
// Save offset
$offset += $newlength;
}
So the problem is here on this line:
$newlength = strlen(fgets($handle));
It fails when csv column has line breaks.
I also tried $newlength = strlen(implode(';', fgetcsv($handle, 0, $csv_delimiter))); but this does not always work. It usually fails for few characters. Probably quotations and end of line is not handled properly here.
All I need is to get length of csv line, not just single line, but csv line which might have line breaks within quotes.
Anybody has better solution?
do one thing, create one mysql temporary table named "my_csv_data", and add one field in that table with all fields which are in csv file and extra add one "is_processed" with enum(0,1) default value '0'.
now import your all csv data in that sql table. it will never take more time for single insert.
now cerate one function/file which access my_csv_data table 10 or 100 records where is_processed='0' and process it and if process done successfully then update "is_processed" field to '1'.
now create one cronjob which hit that file/function. periodically.
using this way data will going to silently insert in your table without disturb/suffer any admin/front end user.
i have codeignitor code where i uploading the csv file data and insert it into mysql database. hope this will help you
if($_FILES["file"]["size"] > 0)
{
$file = fopen($filename, "r");
while (($emapData = fgetcsv($file, 10000, ",")) !== FALSE)
{
$data = array(
'reedumption_code' => $emapData[0],
'jb_note_id' =>$jbmoney_id,
'jbmoney' =>$jbamount,
'add_date'=>time(),
'modify_date'=>time(),
'user_id'=>0,
'status'=>1,
'assign_date'=>0,
'del_status'=>1,
'store_status'=>1
);
$this->load->model('currency_model');
$insertId = $this->currency_model->insertCSV($data);
}
fclose($file);
redirect('currency/add_currency?msg=Data Imported Successfully');
}
I have excel(file.xls)/csv(file.csv) file that contains/will contain hundreds of thousands of entry, even millions I guess. Is it possible to split this one to multiple file? Like file.xls to file1.xls, file2.xls, file3.xls and so on.
Are there any libraries to use? Is this possible on PHP? or how about javascript?
On where I can specify how many rows to be included on each file?
Thanks
Quick and dirty way of splitting a CSV file into several CSV files
$inputFile = 'input.csv';
$outputFile = 'output';
$splitSize = 10000;
$in = fopen($inputFile, 'r');
$rowCount = 0;
$fileCount = 1;
while (!feof($in)) {
if (($rowCount % $splitSize) == 0) {
if ($rowCount > 0) {
fclose($out);
}
$out = fopen($outputFile . $fileCount++ . '.csv', 'w');
}
$data = fgetcsv($in);
if ($data)
fputcsv($out, $data);
$rowCount++;
}
fclose($out);
Yes it is possible to do that in PHP and with CSV files. You basically iterate over the large file and chunk each X rows, forwarding those rows to another file.
You find the information how to open the large CSV file as an iterator in this answer here:
Answer to "how to extract data from csv file in php"
Then you need to chunk the iterator each X rows parts. That can be done as outline here:
Answer to "Need some advice with PHP loop"
Just instead of outputting into multiple <ul>...</ul> HTML lists, you copy over into a new files. That basically works like outlined in:
Answer to "How can I split a CSV file in PHP?"
However this time you want to use the SplFileObject::fputcsv method. Take care you use the latest stable PHP for this, otherwise you need do different, see fputcsv().
If the first line of the original file contains column-headers, you might be as well interested in the following:
Answer to "Process CSV Into Array With Column Headings For Key"
It just shows some ways to extend / process the incomming file. You might not need the full abstraction done there, just keeping the first line around might do it already.
I think You can also use "split by file size":
$part = 1;
$maxSize = 50;//50 Mb
$fopen = fopen('filename.csv','r') or die ('ERROR');
while (($line = fgetcsv($fopen, 10000, ";")) !== FALSE) {
$ftowrite = fopen("Part_$part.csv",'a');
fputcsv($ftowrite,$line);
clearstatcache();
$size = filesize ( "review_p$part.csv" ) / 1000000;
if ($size > $maxSize) {
fclose($ftowrite);
$part++;
}
}
I am using the current code to read a csv file and add it to an array:
echo "starting CSV import<br>";
$current_row = 1;
$handle = fopen($csv, "r");
while ( ($data = fgetcsv($handle, 10000, ",") ) !== FALSE )
{
$number_of_fields = count($data);
if ($current_row == 1) {
//Header line
for ($c=0; $c < $number_of_fields; $c++)
{
$header_array[$c] = $data[$c];
}
} else {
//Data line
for ($c=0; $c < $number_of_fields; $c++)
{
$data_array[$header_array[$c]] = $data[$c];
}
array_push($products, $data_array);
}
$current_row++;
}
fclose($handle);
echo "finished CSV import <br>";
However when using a very large CSV this times out on the server, or has a memory limit error.
I'd like a way to do it in stages, so after the first say 100 lines it will refresh the page, starting at line 101.
I will probably be doing this with a meta refresh and a URL parameter.
I just need to know how to adapt that code above to start at the line I tell it to.
I have looked into fseek() but I'm not sure how to implement this here.
Can you please help?
The timout can be circumvented using
ignore_user_abort(true);
set_time_limit(0);
When experiencing problems with the memory limit, it may be wise to take a step back and look at what you're actually doing with the data you're processing. Are you pushing the data into a database? calculate something off the data but don't need to store the actual data, …
Do you really need to push (array_push($products, $data_array);) the rows into an array (for later processing)? can you instead write to the database directly? or calculate directly? or build an html <table> directly? or whatever the hell you're doing right then an there, within the while() loop, without pushing everything into an array first?
If you're able to chunk the processing, I guess you don't need that array at all. Otherwise you'd have to restore the array for every chunk - not solving the memory issue one bit.
If you can manage to change your processing algorithm to waste less memory / time, you should seriously consider that over any chunked processing requiring a round-trip to the browser (for so many performance and security reasons…).
Anyways, you can, at any time, identify the current stream offset with ftell() and re-set to that position using fseek(). You'd only need to pass that integer to your next iteration.
Also there is no need for your inner for() loops. This should produce the same results:
<?php
$products = array();
$cols = null;
$first = true;
$handle = fopen($csv, "r");
while (($data = fgetcsv($handle, 10000, ",")) !== false) {
if ($first) {
$cols = $data;
$first = false;
} else {
$products[] = array_combine($cols, $data);
}
}
fclose($handle);
echo "finished CSV import <br>";
Trying to use filegetcsv to parse a CSV file and do stuff with it, using the following code found all over the Internet, including the PHP function definition page:
if (($handle = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
print_r($data);
}
fclose($handle);
}
But the code gives me an infinite loop of warnings on the $data = line:
PHP Warning: fgetcsv() expects parameter 1 to be resource, boolean given in...
I know the file I'm opening is a valid file, because if I add a dummy character to the file name I get a different error and no loop.
The file is in a folder with full permissions.
I'm not using a CSV generated by an Excel on Mac (there's a quirky error there)
PHP version 5.1.6, so there should be no problem with the function
I know the file's not too big, or malformed, because I kept shrinking the original file to see if that was a problem and finally just created a custom file in Notepad with nothing more than two lines like:
Value1A,Value1B,Value1C,Value1D
Still looping and giving no data. Here's the full code I'm working with now (using a variable that's greater than the number of lines so I can prove that it would loop infinitely without actually giving my server an infinite loop)
if ($handle = fopen($_SERVER['DOCUMENT_ROOT'].'/tmp/test-csv-file.csv', 'r') !== FALSE) {
while ((($data = fgetcsv($handle, 1000, ',')) !== FALSE) && ($row < 10)) {
print_r($data);
$row++;
}
fclose($handle);
}
So I really have two questions.
1) What could I possibly be overlooking that is causing this loop? I'm half-convinced it's something really "face-palm" simple...
2) Why is the recommended code for this function something that can cause an infinite loop if the file exists but there is some unknown problem? I would have thought the purpose of the !== FALSE and so forth would be to prevent that kind of stuff.
There's no question about what's going on here: the file is not opened successfully. That's why $handle is a bool instead of a resource (var_dump($handle) to confirm this yourself).
fgetcsv then returns null (not false!) because there's an error, and your test doesn't pick this up because you are testing with !== false. As the documentation states:
fgetcsv() returns NULL if an invalid handle is supplied or FALSE on
other errors, including end of file.
I agree that returning null and false for different error conditions is not ideal, and furthermore that it's against the precedent established by lots of other functions, but that's just how it is (and things could be worse). As things stand, you can simply change the test to
while ($data = fgetcsv($handle, 1000, ","))
and it will work correctly in both cases.
Update:
You are the victim of assignment inside an if condition:
if ($handle = fopen($_SERVER['DOCUMENT_ROOT'].'/tmp/test-csv-file.csv', 'r') !== FALSE)
should have been
// wrap the assignment to $handle inside parens!
if (($handle = fopen($_SERVER['DOCUMENT_ROOT'].'/tmp/test-csv-file.csv', 'r')) !== FALSE)
I 'm sure you understand what went wrong here. This is the reason why I choose to never, ever, make assignments inside conditionals. I don't care that it's possible. I don't care that it's shorter. I don't even care that sometimes it's quite less "elegant" to write the loop if the assignment is taken out. If you value your sanity, consider doing the same.
$row = 1;
if (($handle = fopen($_FILES['csv-file']['tmp_name'], "r")) !== FALSE) {
$data = fgetcsv($handle , 1000 , ",");
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num = count($data);
echo "<p> $num fields in line $row: <br /></p>\n";
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c] . "<br />\n";
}
}
fclose($handle);
}
Try given Code Snippet once,because as i have noticed you are missing some important things in your code.