i have a requirement to make report using XLSX file , this report may contains 10.000-1.000.000 rows of trx. i made decision using phpspreadsheet from https://phpspreadsheet.readthedocs.io/en/latest/
The problem is it takes too long to write 10.000 hours of which each record consist of 50 columns. its nearly 24 hours and the script still in running and the progress is 2300/10000, here my codes :
<?php
require 'vendor/autoload.php';
use PhpOffice\PhpSpreadsheet\Spreadsheet;
$client = new \Redis();
$client->connect('192.168.7.147', 6379);
$pool = new \Cache\Adapter\Redis\RedisCachePool($client);
$simpleCache = new \Cache\Bridge\SimpleCache\SimpleCacheBridge($pool);
\PhpOffice\PhpSpreadsheet\Settings::setCache($simpleCache);
$process_time = microtime(true);
if(!file_exists('test.xlsx')) {
$spreadsheet = new Spreadsheet();
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save("test.xlsx");
unset($writer);
}
for($r=1;$r<=10000;$r++) {
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
$spreadsheet = $reader->load("test.xlsx");
$rowArray=[];
for($c=1;$c<=50;$c++) {
$rowArray[]=$r.".Content ".$c;
}
$spreadsheet->getActiveSheet()->fromArray(
$rowArray,
NULL,
'A'.$r
);
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save("test.xlsx");
unset($reader);
unset($writer);
$spreadsheet->disconnectWorksheets();
unset($spreadsheet);
}
$process_time = microtime(true) - $process_time;
echo $process_time."\n";
notes :
i propose CSV file but the clients only wants XLSX
without redis cache it gives memory error even only <400 record
im not intended to read .XLSX using php, only write action. looks like the library reads the entire spreadsheet
at example above it takes open-close file every 1 record file, when im doing open->writeall->close its shows memory error at the mid progress
at example above it takes open-close file every 1 record file, when im
doing open->writeall->close its shows memory error at the mid progress
I see you are opening (createReader) and saving (createWriter) each time when filling content in the loop. It may be the cause of slowing down the process. From your logic, eventually you are writing back the content to the same file, so it can just be open-one-time > write 50x10k records > close-and-save.
A quick test with re-arrange your coding as follows, which result in approximately 25 seconds using my local Xampp in Windows. I'm not sure if this meets your requirement or not, but I think it may consume more time if the content is some long string. My guess is that if you run on a powerful server, the performance might get significant improve in time wise.
$process_time = microtime(true);
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
$spreadsheet = $reader->load($file_loc);
$row_count = 10000;
$col_count = 50;
for ($r = 1; $r <= $row_count; $r++) {
$rowArray = [];
for ($c = 1; $c <= $col_count; $c++) {
$rowArray[] = $r . ".Content " . $c;
}
$spreadsheet->getActiveSheet()->fromArray(
$rowArray,
NULL,
'A' . $r
);
}
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save($target_dir . 'result_' . $file_name);
unset($reader);
unset($writer);
$spreadsheet->disconnectWorksheets();
unset($spreadsheet);
$process_time = microtime(true) - $process_time;
echo $process_time."\n";
Edited:
without redis cache it gives memory error even only <400 record
My quick test is without any cache settings. My guess for the memory issue is that you are opening the XLSX file every time you write the content for each row and then saving it back to the original file.
Every time you open the XLSX file, memory will be loaded and cached with all PhpSpreadsheet object info as well as the [previous content + (50 new columns added each time after saving)] and the memory grows in an exponential way; can you imagine that?
Finally, the memory clearing is way slower and results in memory errors.
1st time open and save
-> open: none
-> save: row A, 50 cols
2nd time open and save
-> open: row A, 50 cols
-> save: row A, 50 cols, row B, 50 cols
3nd time open and save
-> open: row A, 50 cols, row B, 50 cols
-> save: row A, 50 cols, row B, 50 cols, row C, 50 cols
so on and so forth...
memory might still keeping your previous loaded cache
and not releasing so fast (no idea how server is handling the memory, Orz)
and finally memory explode ~ oh no
Related
I am making a Covid-19 statistics website - https://e-server24.eu/ . Every time somebody is entering the website, the PHP script is decoding JSON from 3 urls and storing data into some variables.
I want to make my website more optimized so my question is: Is there any script that can update the variables data one time per day, not every time someone accesses the website?
Thanks,
I suggest looking into memory object caching.
Many high-performance PHP web apps use caching extensions (e.g. Memcached, APCu, WinCache), accelerators (e.g. APC, varnish) and caching DBs like Redis. The setup can be a bit involved but you can get started with a simple role-your-own solution (inspired by this):
<?php
function cache_set($key, $val) {
$val = var_export($val, true);
// HHVM fails at __set_state, so just use object cast for now
$val = str_replace('stdClass::__set_state', '(object)', $val);
// Write to temp file first to ensure atomicity
$tmp = sys_get_temp_dir()."/$key." . uniqid('', true) . '.tmp';
file_put_contents($tmp, '<?php $val = ' . $val . ';', LOCK_EX);
rename($tmp, sys_get_temp_dir()."/$key");
}
function cache_get($key) {
//echo sys_get_temp_dir()."/$key";
#include sys_get_temp_dir()."/$key";
return isset($val) ? $val : false;
}
$ttl_hours = 24;
$now = new DateTime();
// Get results from cache if possible. Otherwise, retrieve it.
$data = cache_get('my_key');
$last_change = cache_get('my_key_last_mod');
if ($data === false || $last_change === false || $now->diff($last_change)->h >= $ttl_hours ) { // cached? h: Number of hours.
// expensive call to get the actual data; we simple create an object to demonstrate the concept
$myObj = new stdClass();
$myObj->name = "John";
$myObj->age = 30;
$myObj->city = "New York";
$data = json_encode($myObj);
// Add to user cache
cache_set('my_key', $data);
$last_change = new DateTime(); //now
// Add timestamp to user cache
cache_set('my_key_last_mod', $last_change);
}
echo $data;
Voila.
Furthermore; you could look into client-side caching and many other things. But this should give you an idea.
PS: Most memory cache systems allow to define a time-to-live (TTL) which makes this more concise. But I wanted to keep this example dependency-free. Cache cleaning was omitted here. Simply delete the temp file.
Simple way to do that
Create a script which will fetch , decode JSON data and store it to your database.
Then set a Cron jobs with time laps of 24 hours .
And when user visit your site fetch the data from your database instead of your api provider.
I've written a script to get all image records from a database, use each image's tempname to find the imagine on disk, copy it to a new folder and create a tar file out of these files. To do so, I'm using PHP's PharData. The problem is the images being TIFF files, and pretty large ones at that (the entire folder of 2000-ish images is 95gb in size).
Initially I created one archive, looped through all database records to find the specific file and used PharData::addFile() to add each file to the archive individually, but this eventually lead to adding a single file taking 15+ seconds.
I've now switched to using PharData::buildFromDirectory in batches, which is significantly faster, but the time to create batches increases with each batch. The first batch got done in 28 seconds, the second in 110, the third one didn't even finish. Code;
$imageLocation = '/path/to/imagefolder';
$copyLocation = '/path/to/backupfolder';
$images = [images];
$archive = new PharData($copyLocation . '/images.tar');
// Set timeout to an hour (adding 2000 files to a tar archive takes time apparently), should be enough
set_time_limit(3600);
$time = microtime(true);
$inCurrentBatch = 0;
$perBatch = 100; // Amount of files to be included in one archive file
$archiveNumber = 1;
foreach ($images as $image) {
$path = $imageLocation . '/' . $image->getTempname();
// If the file exists, copy to folder with proper file name
if (file_exists($path)) {
$copyName = $image->getFilename() . '.tiff';
$copyPath = $copyLocation . '/' . $copyName;
copy($path, $copyPath);
$inCurrentBatch++;
// If the current batch reached the limit, add all files to the archive and remove .tiff files
if ($inCurrentBatch === $perBatch) {
$archive = new PharData($copyLocation . "/images_{$archiveNumber}.tar");
$archive->buildFromDirectory($copyLocation);
array_map('unlink', glob("{$copyLocation}/*.tiff"));
$inCurrentBatch = 0;
$archiveNumber++;
}
}
}
// Archive any leftover files in a last archive
if (glob("{$copyLocation}/*.tiff")) {
$archive = new PharData($copyLocation . "/images_{$archiveNumber}.tar");
$archive->buildFromDirectory($copyLocation);
array_map('unlink', glob("{$copyLocation}/*.tiff"));
}
$taken = microtime(true) - $time;
echo "Done in {$taken} seconds\n";
exit(0);
The copied images get removed in between batches to save on disk space.
We're fine with the entire script taking a while, but I don't understand why the time to create an archive increases so much in between batch.
I need to write in a .xlsx file about 111.100 rows, using fromArray() but I have a strange error
I use phpspreadsheet library
$spreadsheet = new Spreadsheet();
$sheet = $spreadsheet->getActiveSheet();
$columnLetter = 'A';
foreach ($columnNames as $columnName) {
// Allow to access AA column if needed and more
$sheet->setCellValue($columnLetter.'1', $columnName);
$columnLetter++;
}
$i = 2; // Beginning row for active sheet
$columnLetter = 'A';
foreach ($columnValues as $columnValue) {
$sheet->fromArray(array_values($columnValue), NULL, $columnLetter.$i);
$i++;
$columnLetter++;
}
// Create your Office 2007 Excel (XLSX Format)
$writer = new Xlsx($spreadsheet);
// In this case, we want to write the file in the public directory
// e.g /var/www/project/public/my_first_excel_symfony4.xlsx
$excelFilepath = $directory . '/'.$filename.'.xlsx';
// Create the file
$writer->save($excelFilepath);
And I get the exception :
message: "Invalid cell coordinate AAAA18272"
#code: 0
#file: "./vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Cell/Coordinate.php"
Can you help me please ?
Excel pages are limited. The limit is huge but still limited. This is a correct filter so you can't write if there is no space for it.
Anyway you shouldnt use excel pages for such a big amount of data, you can try fragmenting it into smaller pieces, but databases should be the way to manipulate such amount of information
I have a file with the size of around 10 GB or more. The file contains only numbers ranging from 1 to 10 on each line and nothing else. Now the task is to read the data[numbers] from the file and then sort the numbers in ascending or descending order and create a new file with the sorted numbers.
Can anyone of you please help me with the answer?
I'm assuming this is somekind of homework and goal for this is to sort more data than you can hold in your RAM?
Since you only have numbers 1-10, this is not that complicated task. Just open your input file and count how many occourances of every specific number you have. After that you can construct simple loop and write values into another file. Following example is pretty self explainatory.
$inFile = '/path/to/input/file';
$outFile = '/path/to/output/file';
$input = fopen($inFile, 'r');
if ($input === false) {
throw new Exception('Unable to open: ' . $inFile);
}
//$map will be array with size of 10, filled with 0-s
$map = array_fill(1, 10, 0);
//Read file line by line and count how many of each specific number you have
while (!feof($input)) {
$int = (int) fgets($input);
$map[$int]++;
}
fclose($input);
$output = fopen($outFile, 'w');
if ($output === false) {
throw new Exception('Unable to open: ' . $outFile);
}
/*
* Reverse array if you need to change direction between
* ascending and descending order
*/
//$map = array_reverse($map);
//Write values into your output file
foreach ($map AS $number => $count) {
$string = ((string) $number) . PHP_EOL;
for ($i = 0; $i < $count; $i++) {
fwrite($output, $string);
}
}
fclose($output);
Taking into account the fact, that you are dealing with huge files, you should also check script execution time limit for your PHP environment, following example will take VERY long for 10GB+ sized files, but since I didn't see any limitations concerning execution time and performance in your question, I'm assuming it is OK.
I had a similar issue before. Trying to manipulate such a large file ended up being huge drain on resources and it couldn't cope. The easiest solution I ended up with was to try and import it into a MySQL database using a fast data dump function called LOAD DATA INFILE
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Once it's in you should be able to manipulate the data.
Alternatively, you could just read the file line by line while outputting the result into another file line by line with the sorted numbers. Not too sure how well this would work though.
Have you had any previous attempts at it or are you just after a possible method of doing it?
If that's all you don't need PHP (if you have a Linux maschine at hand):
sort -n file > file_sorted-asc
sort -nr file > file_sorted-desc
Edit: OK, here's your solution in PHP (if you have a Linux maschine at hand):
<?php
// Sort ascending
`sort -n file > file_sorted-asc`;
// Sort descending
`sort -nr file > file_sorted-desc`;
?>
:)
I'm using PHPExcel 1.7.8, PHP 5.4.14, Windows 7, and an Excel 2007 spreadsheet. The spreadsheet consists of 750 rows, columns A through BW, and is about 600KB in size. This is my code for opening the spreadsheet--pretty standard:
//Include PHPExcel_IOFactory
include 'PHPExcel/IOFactory.php';
include 'PHPExcel.php';
$inputFileName = 'C:\xls\lspimport\GetLSP1.xlsx';
// Read your Excel workbook
try {
$inputFileType = PHPExcel_IOFactory::identify($inputFileName);
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load($inputFileName);
} catch(Exception $e) {
die('Error loading file "'.pathinfo($inputFileName,PATHINFO_BASENAME).'": '.$e->getMessage());
}
//set active worksheet
$objWorksheet = $objPHPExcel->setActiveSheetIndexbyName('Sheet1');
$j = 0;
for($i = 2; $i < 3; $i++)
{
...
}
In the end, I eventually want to loop through each row in the spreadsheet, but for the time being while I perfect the script, I'm only looping through one row. The problem is, it takes 30 minutes for this script to execute. I echo'd messages after each section of code so I could see what was being processed and when, and my script basically waits for 30 minutes at this part:
$objPHPExcel = $objReader->load($inputFileName);
Have a configured something incorrectly? I can't figure out why it takes 30 minutes to load the spreadsheet. I appreciate any and all help.
PHPExcel has a problem with identifying where the end of your excel file is. Or rather, Excel has a hard time knowing where the end of itself is. If you touch a cell at A:1000000 it thinks it needs to read that far.
I have done 2 things in the past to fix this:
1) Cut and past the data you need into new excel file.
2) Specify the exact dimensions you want to read.
Edit How to do option 2
public function readExcelDataToArray($excelFilePath, $maxRowNumber=-1, $maxColumnNumber=-1)
{
$objPHPExcel = PHPExcel_IOFactory::load($excelFilePath);
$objWorksheet = $objPHPExcel->getActiveSheet();
//Get last row and column that have data
if ($maxRowNumber == -1){
$lastRow = $objWorksheet->getHighestDataRow();
} else {
$lastRow = $maxRowNumber;
}
if ($maxColumnNumber == -1){
$lastCol = $objWorksheet->getHighestDataColumn();
//Change Column letter to column number
$lastCol = PHPExcel_Cell::columnIndexFromString($lastCol);
} else {
$lastCol = $maxColumnNumber;
}
//Get Data Array
$dataArray = array();
for ($currentRow = 1; $currentRow <= $lastRow; $currentRow++){
for ($currentCol = 0; $currentCol <= $lastCol; $currentCol++){
$dataArray[$currentRow][$currentCol] = $objWorksheet->getCellByColumnAndRow($currentCol,, $currentRow)->getValue();
}
}
return $dataArray;
}
Unfortunately these solutions aren't very dynamic.
Note that a modern excel file is really just a zip with an xlsx extension. I have written extensions to PHPExcel that unzip them, and modify certain xml files to get the kinds of behaviors I want.
A third suggestion for you would be to monitor the contents of each row and stop when you get an empty one.
Resolved (for me) - see note at bottom of this post
I'm trying to use pretty much identical code on a dedicated quad core server with 16GB of RAM, also running similar versions - PHPExcel 1.7.9 and PHP 5.4.16
Just creating an empty reader takes 50 seconds!
// $inputFileType is 'Excel5';
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
Loading the spreadsheet (1 sheet, 2000 rows, 25 columns) I want to process (readonly) then takes 1802 seconds.
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load($inputFileName);
Of the various types of reader I consistently get timings for instantiation as shown below
foreach(array(
'Excel2007', // 350 seconds
'Excel5', // 50 seconds
'Excel2003XML', // 50 seconds
'OOCalc', // 50 seconds
'SYLK', // 50 seconds
'Gnumeric', // 50 seconds
'HTML', // 250 seconds
'CSV' // 50 seconds
) as $inputFileType) {
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
}
Peak memory usage was about 8MB... far less than the 250MB the script has available to it.
My suspicion WAS that PHPExcel_IOFactory::createReader($inputFileType) was calling something within a loop that's extremely slow under PHP 5.4.x ?
However the excessive time was due to how PHPExcel names its class names and corresponding file structure. It has an autoloader that converts class names such as *PHPExcel_abc_def* into PHPExcel/abc/def.php for the require statement. Although we had PHPExcel's class directory defined in our include path, our own (already defined) autoloader couldn't handle the manipulation from class name to file name required (it was looking for *PHPExcel_abc_def.php*). When a class file cannot be included, our autoloader will loop 5 times with a 10 second delay to see if the file is being updated and so might become available. So for every PHPExcel class that needed to be loaded we were introducing a delay of 50 seconds before hitting PHPExcel's own autoloader which required the file in fine.
Now that I've got that resolved PHPExcel is proving to be truly awesome.
I'm using the latest version of PHPExcel (1.8.1) in a Symfony project, and I also ran into time delays when using the $objReader->load($file) method. The time delays were not due to an autoloader, but to the load method itself. This method actually reads every cell in every worksheet. And since my data worksheet was 30 columns wide by 5000 rows, it took about 90 seconds to read all this information on my ancient work computer.
I assumed that the real loading/reading of cell values would occur on the fly as I requested them, but it looks like short of a pretty major re-write of the PHPExcel code, there's no real way around this initial load time delay.
If you know your file is a pretty plain excel file, you can do manual reading. A .xslx file is just a zip archive with the spreadsheet values and structure stored into xml files. This script took me from the 60 seconds used on PHPExcel down to 0.18 seconds.
$zip = new ZipArchive();
$zip->open('path_to/file.xlsx');
$sheet_xml = simplexml_load_string($zip->getFromName('xl/worksheets/sheet1.xml'));
$sheet_array = json_decode(json_encode($xml), true);
$values = simplexml_load_string($zip->getFromName('xl/sharedStrings.xml'));
$values_array = json_decode(json_encode($values), true);
$end_result = array();
if ($sheet_array['sheetData']) {
foreach ($sheet_array['sheetData']['row'] as $r => $row) {
$end_result[$r] = array();
foreach ($row['c'] as $c => $cell) {
if (isset($cell['#attributes']['t'])) {
if ($cell['#attributes']['t'] == 's') {
$end_result[$r][] = $values_array['si'][$cell['v']]['t'];
} else if ($cell['#attributes']['t'] == 'e') {
$end_result[$r][] = '';
}
} else {
$end_result[$r][] = $cell['v'];
}
}
}
}
Result:
Array
(
[0] => Array
(
[0] => A1
[1] => B1
[2] => C1
)
[1] => Array
(
[0] => A2
[1] => B2
[2] => C2
)
)
This is error prone and not optimized, but it works and illustrates the basic idea. If you know your file, then you can make reading very fast. If you allow users to input the files, then you should maybe avoid it - or at least do the neccessary checks.