i have a large xlsx file having approx 24 MB size. It takes too much time even i have to read first row only. If spout read each row one by one then why it takes too much time evenif i have to read first row only ?
Following is a complete code
require_once 'src/Spout/Autoloader/autoload.php';
$file_path = $_SERVER["DOCUMENT_ROOT"].'spout'.'/'.'testdata.xlsx';
use Box\Spout\Reader\ReaderFactory;
use Box\Spout\Common\Type;
libxml_disable_entity_loader(false);
try {
//Lokasi file excel
$reader = ReaderFactory::create(Type::XLSX); //set Type file xlsx
$reader->open($file_path); //open the file
$i = 0;
/**
* Sheets Iterator. Kali aja multiple sheets
**/
foreach ($reader->getSheetIterator() as $sheet) {
//Rows iterator
foreach ($sheet->getRowIterator() as $row) {
echo $i."<hr>";
if($i==0) // if first row
{
print_r($row);
exit; // exist after reading first row
}
++$i;
}
exit;
}
echo "Total Rows : " . $i;
$reader->close();
echo "Peak memory:", (memory_get_peak_usage(true) / 1024 / 1024), " MB";
}
catch (Exception $e) {
echo $e->getMessage();
exit;
}
Can you help me what is reason regarding this issue. How can i do it fast ?
You can find test xlsx file at http://www.mediafire.com/file/y369njsaeeah1ip/testdata.xlsx
Excel file contains following content:
number of rows : 999991
number of columns : 4 (i.e MPN,CATEGORY,MFG,Description)
file size approx 24 MB and does not contain any formatting.
There are 2 ways to store cell data with XLSX files:
the simplest one is to keep the cell values with the cells structure (i.e. cell "A1" contains "foo", "B1" contains "bar").
the other way to do it is by keeping track of the different values used in the spreadsheet and add a layer of redirection that will help removing duplicates: this translates to 2 files, one describing the structure (i.e. cell "A1" contains value referenced by ID1, "B1" => ID2, "C1" => ID1) and one describing the values (ID1 => "foo", ID2 => "bar").
The method 2 optimizes the size of the file since a value used N times will be stored only once (but referenced N times). However to read these values, you now need to read 2 files and have the mapping ready when you read the structure. Basically, in order to read the first row, you will read the structure to get the cells (A1, B1, C1) and then you need to resolve the values, using the IDs.
The inline method is more straightforward since everything is stored at the same place so you can read the structure and the values at the same time. No need for a mapping table.
Now back to your problem! The file you're trying to read is most likely using the method 2 (a file describring the spreadsheet's structure + a file containing all the values). When Spout inits the reader, it processes the file containing the values so that the mapping is ready whenever we start ready rows.
This processing can take a long time if there are a lot of values. Below a certain threshold (which depends on the amount of memory available), Spout loads the mapping [ID => value] into memory, which is pretty fast. However if there are too many values, Spout decides that everything may not fit into memory and caches chunks of the mapping on disk. This process is definitely time consuming...
So this is what's happening in your case. Hopefully that makes more sense now.
Eventually the threshold will be moved higher, as Spout is currently ultra defensive to avoid out of memory issues.
Related
i'm looking for an algorithm strategy. I have a csv file with 162 columns and 55000 lines.
I want to order the datas with one date (which is on column 3).
first i tried directly to put everything in an array, but memory explodes.
So i decided to :
1/ Put in an array the 3 first columns.
2/ Order this array with usort
3/ read the csv file to recover the other columns
4/ Add in a new csv file the complete line
5/ replace the line by an empty string on the readed csv file
//First read of the file
while(($data = fgetcsv($handle, 0,';')) !== false)
{
$tabLigne[$columnNames[0]] = $data[0];
$tabLigne[$columnNames[1]] = $data[1];
$tabLigne[$columnNames[2]] = $data[2];
$dateCreation = DateTime::createFromFormat('d/m/Y', $tabLigne['Date de Création']);
if($dateCreation !== false)
{
$tableauDossiers[$row] = $tabLigne;
}
$row++;
unset($data);
unset($tabLigne);
}
//Order the array by date
usort(
$tableauDossiers,
function($x, $y) {
$date1 = DateTime::createFromFormat('d/m/Y', $x['Date de Création']);
$date2 = DateTime::createFromFormat('d/m/Y', $y['Date de Création']);
return $date1->format('U')> $date2->format('U');
}
);
fclose($handle);
copy(PATH_CSV.'original_file.csv', PATH_CSV.'copy_of_file.csv');
for ($row = 3; $row <= count($tableauDossiers); $row++)
{
$handle = fopen(PATH_CSV.'copy_of_file.csv', 'c+');
$tabHandle = file(PATH_CSV.'copy_of_file.csv');
fgetcsv($handle);
fgetcsv($handle);
$rowHandle = 2;
while(($data = fgetcsv($handle, 0,';')) !== false)
{
if($tableauDossiers[$row]['Caisse Locale Déléguée'] == $data[0]
&& $tableauDossiers[$row]['Date de Création'] == $data[1]
&& $tableauDossiers[$row]['Numéro RCT'] == $data[2])
{
fputcsv($fichierSortieDossier, $data,';');
$tabHandle[$rowHandle]=str_replace("\n",'', $tabHandle[$rowHandle]);
file_put_contents(PATH_CSV.'copy_of_file.csv', $tabHandle);
unset($tabHandle);
break;
}
$rowHandle++;
unset($data);
unset($tabLigne);
}
fclose($handle);
unset($handle);
}
This algo is really too long to execute, but works
Any idea how to improve it ?
Thanks
Assuming you are limited to using PHP and can not use a database to implement it as suggested in the comments, the next best option is to use an external sorting algorithm.
Split the file into small files. The files should be small enough to sort them in memory.
Sort all these files individually in memory.
Merge the sorted files to one big file by comparing the first lines of each file.
The merging of the sorted files can be done very memory efficient: You only need to have the first line of each file in memory at any given time. The first line with the minimal timestamp should go to the resulting file.
For really big files you can cascade the merging ie: if you have 10,000 files you can merge groups of 100 files first then merge the resulting 100 files.
Example
I use a comma to separate values instead of line-breaks for readability.
The unsorted file (imagine it to be be too big to fit into memory):
1, 6, 2, 4, 5, 3
Split the files in parts that are small enough to fit into memory:
1, 6, 2
4, 5, 3
Sort them individually:
1, 2, 6
3, 4, 5
Now merge:
Compare 1 & 3 → take 1
Compare 2 & 3 → take 2
Compare 6 & 3 → take 3
Compare 6 & 4 → take 4
Compare 6 & 5 → take 5
Take 6.
You have a fairly large set of data to process, so you need to do something to optimize it.
You could increase your memory, but that will only postpone the error, when there is a bigger file, it'll crash then (or get waaaayy too slow).
The first option is try to minimize the amount of data. Remove all non-relevant columns from the file. Whichever solution you apply, a smaller dataset is always faster.
I suggest you put it into a database and apply your requirements to it, using that result to create a new file. A database is made to manage large data sets, so it'll take a whole lot less time.
Taking that much data and write that to a file from PHP will still be slow, but could be manageble. Another tactic might be using the commandline, using a .sh file. If you have do basic terminal/ssh skills, you have basic .sh writing capabilities. In that file, you can use mysqldump to export as csv like this. Mysqldump will be significantly faster, but it's a bit trickier to get going when you're used to PHP.
To improve your current code:
- The unset at the end of the first will don't do anything userful. They barely store data and get reset anyways when the next itteration of the while starts.
- Instead of DateTime() for everything, which is easier to work with, but slower, use epoch values. I dont know in what format it comes now, but if you use epoch seconds (like the result of time()), you have two numbers. Your usort() will improve drastically, as it no longer has to use the heavy DateTime class, but just a simple number comparison.
This all asumes that you need to do it muliple times. If not, just open it in Excel or Numbers, use that sort and save as copy.
I've only tried this on a small file, but the principle is very similar to your idea of reading the file, stores the dates and then sorts it. Then reading the original file and writing out the sorted data.
In this version, the load just reads the dates in and creates an array which holds the date and the position in the file of the start of the line (using ftell() after each read to get the file pointer).
It then sorts this array (as date is first just uses normal sort).
Then it goes through the sorted array and for each entry it uses fseek() to locate the record in the file and reads the line (using fgets()) and writes this line to the output file...
$file = "a.csv";
$out = "sorted.csv";
$handle = fopen($file, "r");
$tabligne = [];
$start = 0;
while ( $data = fgetcsv($handle) ) {
$tabligne[] = ['date' => DateTime::createFromFormat('d/m/Y', $data[2]),
'start' => $start ];
$start = ftell($handle);
}
sort($tabligne);
$outHandle = fopen( $out, "w" );
foreach ( $tabligne as $entry ) {
fseek($handle, $entry['start']);
$copy = fgets($handle);
fwrite($outHandle, $copy);
}
fclose($outHandle);
fclose($handle);
I would load the data in a database and let that worry about the underlying algorithm.
If this is a one-time issue, i would suggest not to automate it and use a spreadsheet instead.
I'm parsing a 1 000 000 line csv file in PHP to recover this datas: IP Address, DNS , Cipher suites used.
In order to know if some DNS (having several mail servers) has different Cipher suites used on their servers, I have to store in a array a object containing the DNS name, a list of the IP Address of his servers, and a list of cipher suites he uses. At the end I have an array of 1 000 000 elements. To know the number of DNS having different cipher suites config on their servers I do:
foreach($this->allDNS as $dnsObject){
$res=0;
if(count($dnsObject->getCiphers()) > 1){ //if it has several different config
res++;
}
return $res;
}
Problem: Consumes too much memory, i can't run my code on 1000000 line csv (if I don't store these data in a array, I parse this csv file in 20 sec...). Is there a way to bypass this problem ?
NB: I already put
ini_set('memory_limit', '-1');
but this line just bypass the memory error.
Saving all of those CSV data will definitely take its toll on the memory.
One logical solution to your problem is to have a database that will store all of those data.
You may refer to this link for a tutorial on parsing your CSV file and storing it to database.
Write the processed Data (for each Line seperately) into one File (or Database)
file_put_contents('data.txt', $parsingresult, FILE_APPEND);
FILE_APPEND will append the $parsingresult at the End of the File-Content.
Then you can access the processed Data by file_get_contents() or file().
Anyways. I think, using a Database and some Pre-Processing would be the best Solution if this is needed more often.
You can use fgetcsv() to read and parse the CSV file one line at a time. Keep the data you need and discard the line:
// Store the useful data here
$data = array();
// Open the CSV file
$fh = fopen('data.csv', 'r');
// The first line probably contains the column names
$header = fgetcsv($fh);
// Read and parse one data line at a time
while ($row = fgetcsv($fh)) {
// Get the desired columns from $row
// Use $header if the order or number of columns is not known in advance
// Store the gathered info into $data
}
// Close the CSV file
fclose($fh);
This way it uses the minimum amount of memory needed to parse the CSV file.
I have csv file with 104 fields, but I need only 4 fields to use in mysql database. each file has about a million rows.
could somebody tell me efficient way to do this? reading each line to array takes long time.
thanks
You have to read every line in its entirety by definition. This is necessary to find the delimiter for the next record (i.e. the newline character). You only need to discard the data you have read that you don't need. E.g.:
$data = array();
$fh = fopen('data.csv', 'r');
$headers = fgetcsv($fh);
while ($row = fgetcsv($fh)) {
$row = array_combine($headers, $row);
$data[] = array_intersect_key($row, array_flip(array('foo', 'bar', 'baz')));
// alternatively, if you know the column index, something like:
// $data[] = array($row[1], $row[45], $row[60]);
}
This only retains the columns foo, bar and baz and discards the rest. The reading from file (fgetcsv) is about as fast as it gets. If you need it any faster, you'll have to implement your own CSV tokenizer and parser which skips over the columns you don't need without even temporarily storing them in memory; how much of a performance boost this brings vs. development time necessary to implement this bug free is very debatable.
simple excel macro can drop all unnecessary columns (100 out of 104)
within second. I am looking for similar solution.
That is because Excel, once a file is opened, has all data in memory and can act on it very quickly. For an accurate comparison you need to compare the time it takes to open the file in Excel + dropping of the columns, not just dropping the columns.
I need exposrt mysql table to .xls format, this is fragment from my code
$result = mysql_query( /* here query */ );
$objPHPExcel = new PHPExcel();
$rowNumber = 1;
while ($row = mysql_fetch_row($result)) {
$col = 'A';
foreach($row as $cell) {
$objPHPExcel->getActiveSheet()->setCellValue($col.$rowNumber,$cell);
$col++;
}
$rowNumber++;
}
Problem is that, in table is 500 000 rows and in while cycle at every iteration when I make also foreach cycle, this takes very many time at php file execution.
Possible to optimize this code?
500,000 rows will always take a lot of time to write.... even if you speed it up by using the worksheet's fromArray() method to get rid of your foreach loop; and (as nichar has pointed out) this is too many rows for the xls format to handle unless you split them across multiple worksheets.
You can reduce the memory requirements by enabling cell caching (SQLite gives the best memory usage), but it will still take a long time to execute for 500,000 rows and anything this size should be run as a batch/cron job
This is a point to note rather than a direct answer to your question - but if the Excel file format you're outputting is .xls, the maximum rows would be 65,536 and if it is MS Excel 2007+ format e.g .xlsx, the maximum rows would be 1,048,576.
So without changing the output format to .xlsx (which is an entirely different structure), the files will be too large to open.
Consider dumping the data into a csv file, and then importing it into Excel. Should be a lot faster.
If you get a php timeout, you can reset the limit by adding this inside the while or for loop:
set_time_limit(300); //whatever seconds you want
If you're running it through the browser, your server may be timing out. I recommend you run it on command line to avoid this.
Also, similar to what nickhar mentioned, it can be an excel issue. I would try outputting as a csv file. I think it will allow you to output more lines.
I am using PHP to expose vehicle GPS data from a CSV file. This data is captured at least every 30 seconds for over 70 vehicles and includes 19 columns of data. This produces several thousand rows of data and file sizes around 614kb. New data is appended to end of the file. I need to pull out the last row of data for each vehicle, which should represent the most the recent status. I am able to pull out one row for each unit, however since the CSV file is in chronological order I am pulling out the oldest data in the file instead of the newest. Is it possible to read the CSV from the end to the beginning? I have seen some solutions, however they typically involve loading the entire file into memory and then reversing it, this sounds very inefficient. Do I have any other options? Thank you for any advice you can offer.
EDIT: I am using this data to map real-time locations on-the-fly. The data is only provided to me in CSV format, so I think importing into a DB is out of the question.
With fseek you can set the pointer to the end of the file and offset it negative to read a file backwards.
If you must use csv files instead of a database, then perhaps you could read the file line-by-line. This will prevent more than the last line being stored in memory (thanks to the garbage collector).
$handle = #fopen("/path/to/yourfile.csv", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// old values of $last are garbage collected after re-assignment
$last = $line;
// you can perform optional computations on past data here if desired
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
// $last will now contain the last line of the file.
// You may now do whatever with it
}
edit: I did not see the fseek() post. If all you need is the last line, then that is the way to go.