Creating multiple tar files in batches duration increases with every batch - php

I've written a script to get all image records from a database, use each image's tempname to find the imagine on disk, copy it to a new folder and create a tar file out of these files. To do so, I'm using PHP's PharData. The problem is the images being TIFF files, and pretty large ones at that (the entire folder of 2000-ish images is 95gb in size).
Initially I created one archive, looped through all database records to find the specific file and used PharData::addFile() to add each file to the archive individually, but this eventually lead to adding a single file taking 15+ seconds.
I've now switched to using PharData::buildFromDirectory in batches, which is significantly faster, but the time to create batches increases with each batch. The first batch got done in 28 seconds, the second in 110, the third one didn't even finish. Code;
$imageLocation = '/path/to/imagefolder';
$copyLocation = '/path/to/backupfolder';
$images = [images];
$archive = new PharData($copyLocation . '/images.tar');
// Set timeout to an hour (adding 2000 files to a tar archive takes time apparently), should be enough
set_time_limit(3600);
$time = microtime(true);
$inCurrentBatch = 0;
$perBatch = 100; // Amount of files to be included in one archive file
$archiveNumber = 1;
foreach ($images as $image) {
$path = $imageLocation . '/' . $image->getTempname();
// If the file exists, copy to folder with proper file name
if (file_exists($path)) {
$copyName = $image->getFilename() . '.tiff';
$copyPath = $copyLocation . '/' . $copyName;
copy($path, $copyPath);
$inCurrentBatch++;
// If the current batch reached the limit, add all files to the archive and remove .tiff files
if ($inCurrentBatch === $perBatch) {
$archive = new PharData($copyLocation . "/images_{$archiveNumber}.tar");
$archive->buildFromDirectory($copyLocation);
array_map('unlink', glob("{$copyLocation}/*.tiff"));
$inCurrentBatch = 0;
$archiveNumber++;
}
}
}
// Archive any leftover files in a last archive
if (glob("{$copyLocation}/*.tiff")) {
$archive = new PharData($copyLocation . "/images_{$archiveNumber}.tar");
$archive->buildFromDirectory($copyLocation);
array_map('unlink', glob("{$copyLocation}/*.tiff"));
}
$taken = microtime(true) - $time;
echo "Done in {$taken} seconds\n";
exit(0);
The copied images get removed in between batches to save on disk space.
We're fine with the entire script taking a while, but I don't understand why the time to create an archive increases so much in between batch.

Related

Remove Prestashop orphan images not stored in DB

I need to clean a shop running Prestashop, actually 1.7, since many years.
With this script I removed all the images in the DB not connected to any product.
But there are many files not listed in the DB. For example, actually I have 5 image sizes in settings, so new products shows 6 files in the folder (the 5 above and the imageID.jpg file) but some old product had up to 18 files. Many of these old products have been deleted but in the folder I still find all the other formats, like "2026-small-cart.jpg".
So I tried creating a script to loop in folders, check image files in it and verify if that id_image is stored in the DB.
If not, I can delete the file.
It works but obviously the loop is huge and it stops working as long as I change the starting path folder.
I've tried to reduce the DB queries storing some data (to delete all the images with the same id with a single DB query), but it still crashes as I change the starting path.
It only works with two nested loops (really few...).
Here is the code. Any idea for a better way to get the result?
Thanks!
$shop_root = $_SERVER['DOCUMENT_ROOT'].'/';
include('./config/config.inc.php');
include('./init.php');
$image_folder = 'img/p/';
$image_folder = 'img/p/2/0/3/2/'; // TEST, existing product
$image_folder = 'img/p/2/0/2/6/'; // TEST, product deleted from DB but files in folder
//$image_folder = 'img/p/2/0/2/'; // test, not working...
$scan_dir = $shop_root.$image_folder;
// will check only images...
global $imgExt;
$imgExt = array("jpg","png","gif","jpeg");
// to avoid multiple queries for the same image id...
global $lastID;
global $delMode;
echo "<h1>Examined folder: $image_folder</h1>\r\n";
function checkFile($scan_dir,$name) {
global $lastID;
global $delMode;
$path = $scan_dir.$name;
$ext = substr($name,strripos($name,".")+1);
// if is an image and file name starts with a number
if (in_array($ext,$imgExt) && (int)$name>0){
// avoid extra queries...
if ($lastID == (int)$name) {
$inDb = $lastID;
} else {
$inDb = (int)Db::getInstance()->getValue('SELECT id_product FROM '._DB_PREFIX_.'image WHERE id_image ='.((int) $name));
$lastID = (int)$name;
$delMode = $inDb;
}
// if haven't found an id_product in the DB for that id_image
if ($delMode<1){
echo "- $path has no related product in the DB I'll DELETE IT<br>\r\n";
//unlink($path);
}
}
}
function checkDir($scan_dir,$name2) {
echo "<h3>Elements found in the folder <i>$scan_dir$name2</i>:</h3>\r\n";
$files = array_values(array_diff(scandir($scan_dir.$name2.'/'), array('..', '.')));
foreach ($files as $key => $name) {
$path = $scan_dir.$name;
if (is_dir($path)) {
// new loop in the subfolder
checkDir($scan_dir,$name);
} else {
// is a file, I'll check if must be deleted
checkFile($scan_dir,$name);
}
}
}
checkDir($scan_dir,'');
I would create two files with lists of images.
The first file is the result of a query from your database of every image file referenced in your data.
mysql -BN -e "select distinct id_image from ${DB}.${DB_PREFIX}image" > all_image_ids
(set the shell variables for DB and DB_PREFIX first)
The second file is every image file currently in your directories. Include only files that start with a digit and have an image extension.
find img/p -name '[0-9]*.{jpg,png,gif,jpeg}' > all_image_files
For each filename, check if it's in the list of image ids. If not, then output the command to delete the file.
cat all_image_files | while read filename ; do
# strip the directory name and convert filename to an integer value
b=$(basename $filename)
image_id=$((${b/.*/}))
grep -q "^${image_id}$" all_image_ids || echo "rm ${filename}"
done > files_to_delete
Read the file files_to_delete to visually check that the list looks right. Then run that file as a shell script:
sh files_to_delete
Note I have not tested this solution, but it should give you something to experiment with.

phpspreadsheet write 10.000 record is too slow

i have a requirement to make report using XLSX file , this report may contains 10.000-1.000.000 rows of trx. i made decision using phpspreadsheet from https://phpspreadsheet.readthedocs.io/en/latest/
The problem is it takes too long to write 10.000 hours of which each record consist of 50 columns. its nearly 24 hours and the script still in running and the progress is 2300/10000, here my codes :
<?php
require 'vendor/autoload.php';
use PhpOffice\PhpSpreadsheet\Spreadsheet;
$client = new \Redis();
$client->connect('192.168.7.147', 6379);
$pool = new \Cache\Adapter\Redis\RedisCachePool($client);
$simpleCache = new \Cache\Bridge\SimpleCache\SimpleCacheBridge($pool);
\PhpOffice\PhpSpreadsheet\Settings::setCache($simpleCache);
$process_time = microtime(true);
if(!file_exists('test.xlsx')) {
$spreadsheet = new Spreadsheet();
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save("test.xlsx");
unset($writer);
}
for($r=1;$r<=10000;$r++) {
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
$spreadsheet = $reader->load("test.xlsx");
$rowArray=[];
for($c=1;$c<=50;$c++) {
$rowArray[]=$r.".Content ".$c;
}
$spreadsheet->getActiveSheet()->fromArray(
$rowArray,
NULL,
'A'.$r
);
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save("test.xlsx");
unset($reader);
unset($writer);
$spreadsheet->disconnectWorksheets();
unset($spreadsheet);
}
$process_time = microtime(true) - $process_time;
echo $process_time."\n";
notes :
i propose CSV file but the clients only wants XLSX
without redis cache it gives memory error even only <400 record
im not intended to read .XLSX using php, only write action. looks like the library reads the entire spreadsheet
at example above it takes open-close file every 1 record file, when im doing open->writeall->close its shows memory error at the mid progress
at example above it takes open-close file every 1 record file, when im
doing open->writeall->close its shows memory error at the mid progress
I see you are opening (createReader) and saving (createWriter) each time when filling content in the loop. It may be the cause of slowing down the process. From your logic, eventually you are writing back the content to the same file, so it can just be open-one-time > write 50x10k records > close-and-save.
A quick test with re-arrange your coding as follows, which result in approximately 25 seconds using my local Xampp in Windows. I'm not sure if this meets your requirement or not, but I think it may consume more time if the content is some long string. My guess is that if you run on a powerful server, the performance might get significant improve in time wise.
$process_time = microtime(true);
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
$spreadsheet = $reader->load($file_loc);
$row_count = 10000;
$col_count = 50;
for ($r = 1; $r <= $row_count; $r++) {
$rowArray = [];
for ($c = 1; $c <= $col_count; $c++) {
$rowArray[] = $r . ".Content " . $c;
}
$spreadsheet->getActiveSheet()->fromArray(
$rowArray,
NULL,
'A' . $r
);
}
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$writer->save($target_dir . 'result_' . $file_name);
unset($reader);
unset($writer);
$spreadsheet->disconnectWorksheets();
unset($spreadsheet);
$process_time = microtime(true) - $process_time;
echo $process_time."\n";
Edited:
without redis cache it gives memory error even only <400 record
My quick test is without any cache settings. My guess for the memory issue is that you are opening the XLSX file every time you write the content for each row and then saving it back to the original file.
Every time you open the XLSX file, memory will be loaded and cached with all PhpSpreadsheet object info as well as the [previous content + (50 new columns added each time after saving)] and the memory grows in an exponential way; can you imagine that?
Finally, the memory clearing is way slower and results in memory errors.
1st time open and save
-> open: none
-> save: row A, 50 cols
2nd time open and save
-> open: row A, 50 cols
-> save: row A, 50 cols, row B, 50 cols
3nd time open and save
-> open: row A, 50 cols, row B, 50 cols
-> save: row A, 50 cols, row B, 50 cols, row C, 50 cols
so on and so forth...
memory might still keeping your previous loaded cache
and not releasing so fast (no idea how server is handling the memory, Orz)
and finally memory explode ~ oh no

Get latest file in dir including subdirectory php

i found out that i can use
$files = scandir('c:\myfolder', SCANDIR_SORT_DESCENDING);
$newest_file = $files[0];
to get the latest file in the given directory ('myfolder').
is there an easy way to get the latest file including subdirectorys?
like:
myfolder> dir1 > file_older1.txt
myfolder> dir2 > dir3 > newest_file_in_maindir.txt
myfolder> dir4 > file_older2.txt
Thanks in advance
To the best of my knowledge, you have to recursively check every folder and file to get the last modified file. And you're current solution doesn't check the last modified file but sorts the files in descending order by name.
Meaning if you have a 10 years old file named z.txt it will probably end up on top.
I've cooked up a solution.
The function accepts a directory name and makes sure the directory
exists. It returns null when the directory has no files or any of its subdirectories.
Sets aside the variables $latest and $latestTime where the last modified file is stored.
It loops through the directory, avoiding the . and .. since they can cause an infinite recursion loop.
In the loop the full filename is assembled from the initial directory name and the part.
The filename is checked if it is a directory if so it calls the same function we are in and saves the result.
If the result is null we continue the loop otherwise we save the file as the new filename, which we now know is a file.
After that we check the last modified time using filemtime and see if the $latestTime is smaller meaning the file was modified earlier in time that the current one.
If the new file is indeed younger we save the new values to $latest and $latestTime where $latest is the filename.
When the loop finishes we return the result.
function find_last_modified_file(string $dir): ?string
{
if (!is_dir($dir)) throw new \ValueError('Expecting a valid directory!');
$latest = null;
$latestTime = 0;
foreach (scandir($dir) as $path) if (!in_array($path, ['.', '..'], true)) {
$filename = $dir . DIRECTORY_SEPARATOR . $path;
if (is_dir($filename)) {
$directoryLastModifiedFile = find_last_modified_file($filename);
if (null === $directoryLastModifiedFile) {
continue;
} else {
$filename = $directoryLastModifiedFile;
}
}
$lastModified = filemtime($filename);
if ($lastModified > $latestTime) {
$latestTime = $lastModified;
$latest = $filename;
}
}
return $latest;
}
echo find_last_modified_file(__DIR__);
In step 7 there is an edge case if both files were modified at the exact same time this is up to you how you want to solve. I've opted to leaving the initial file with that modified time instead of updating it.

How can I detect a file change during a chunking upload?

I am doing a chunk file upload to a server and I am interested in such a thing as determining whether this file has changed.
Suppose I send a file of size 5 mb. The size of one chunk is 1 mb. 4 chunks are sent to the server, and the last one did not go because of a broken connection.
After the document was changed at the beginning and the first chunk of this document no longer matches the first chunk on the server, the last chunk is loaded, but the contents are already different.
In order to determine that one of the chunks has been changed, you need to re-send all the chunks to the server to get the hash of the amount, but then the whole load loses its meaning.
How can I determine that a file has been modified without sending all the chunks to the server?
Additionally:
Uploading files is as follows:
First, the browser sends a request to create a new session for uploading files.
Request params:
part_size: 4767232
files: [
{
"file_size":4767232,
"file_type":"application/msword",
"file_name":"5 mb.doc"
}
]
Next:
New records about uploaded files are added to the database.
The server creates temporary folders for storing chunks.
(Folder name is the guids of the file record created by the database.)
The method returns file guids.
After receiving the guids, the browser divides the files into chunks using the JavaScript Blob.slice() method and sends each chunk as a separate request, attaching the file identifier to the request.
Chunks are saved and after the last chunk has been uploaded the file is collected.
Code:
/**
* #param $binary - The binary data.
* #param $directory - Current file with chunks directory path.
*/
private static function createChunk($binary, $directory)
{
//Create a unique id for the chunk.
$id = 'chunk_' . md5($binary);
// Save the chunk to dad with the rest of the chunks of the file.
Storage::put($directory . '/' . $id, $binary);
// We get a json file with information about uploading session.
$session = self::uploadSessionInfo($directory);
// Increase the number of loaded chunks by 1 and add a new element to the subarray chunks.
$session['chunks_info']['loaded_chunks'] = $session['chunks_info']['loaded_chunks'] + 1;
$session['chunks_info']['chunks'][] = [
'chunk_id' => $id
];
// Save the modified session file.
Storage::put($directory . '/session.json', json_encode($session));
// If the number of loaded chunks and the number of total chunks are the same, then we create the final file.
if ($session['chunks_info']['total_chunks'] === $session['chunks_info']['loaded_chunks']) {
Storage::put($directory . '/' . $session['file_name'], []);
foreach ($session['chunks_info']['chunks'] as $key => $value) {
$file = fopen(storage_path() . '/app/' . $directory . '/' . $value['chunk_id'], 'rb');
$buff = fread($file, 2097152);
fclose($file);
$final = fopen(storage_path() . '/app/'. $directory . '/' . $session['file_name'], 'ab');
$write = fwrite($final, $buff);
fclose($final);
}
}
}
Visually looks like this:

PHP Zip Maximum Number of Files

I have a little Problem. I have a Script, that allows people to upload files in a multiple select input. Those input fields are submitted via HTTPXML Request. But i tried to select 98 Pictures with about 900MB - They were uploaded and the ZIP Script finished without any error. But when I want to download the File it is only 200 MB and only 20 Pictures are in the ZIP. I increased the maximum execution time on the server - but the script seems to run for 24 seconds only. I increased the PHP Memory Limit to 2 GB - The Server has enough Ram as well. The Maximum File Size is about 2GB the Maximum Upload Size too.
Here The Script:
$zip = new ZipArchive();
$res = $zip->open(__DIR__."/../files/".$filename, ZIPARCHIVE::CREATE);
if($res){
for($i = 0; $i < count($_FILES['datei']['name']); $i++){
move_uploaded_file($_FILES['datei']['tmp_name'][$i], __DIR__.'/../temp/'.$_FILES['datei']['name'][$i]);
if(file_exists(__DIR__.'/../temp/'.$_FILES['datei']['name'][$i]) && is_readable(__DIR__.'/../temp/'.$_FILES['datei']['name'][$i])){
$zip->addFile(__DIR__.'/../temp/'.$_FILES['datei']['name'][$i], $_FILES['datei']['name'][$i]);
}else{
$status['uploaded_file'] = 500;
}
}
$res_close = $zip->close();
if($res_close){
$status['uploaded_file'] = 200;
}
for($i = 0; $i < count($_FILES['datei']['name']); $i++){
unlink(__DIR__.'/../temp/'.$_FILES['datei']['name'][$i]);
}
}else{
die($res);
$status['uploaded_file'] = 500;
}
The Script basically moves all the TEMP Files to another TEMP Folder. From Those TEMP Folder they are being Zipped to a file folder. Afterwords the Files from the TEMP Folder are going to be deleted.
Is there anything stupid I am doing wrong? Or is there another limitation I didnt see?
Thanks for Help

Categories