Generating ZIP files with PHP + Apache on-the-fly in high speed?

Generating ZIP files with PHP + Apache on-the-fly in high speed? - php

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.

Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)

You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

Related

ID3 Tag while merging MP3s in real time, using PHP and HTTP Partial Content

What we want to do is to add a kind of MP3 preroll to an other MP3 file in real time. That means we have two physical MP3 files on the server which are not merged into one yet, because ffmpeg & Co. take too much time. It has to be in real time to not loose time when someone starts the (web)player. The practical case is to add prerolls to podcast files. What we already did (described below) works, except displaying the correct file duration in audio players.
One of my co-workers did this, so I try to describe as good as possible.
What my coworker already did is telling the header that two files are coming in a row by reading both files and echoing them via PHP. HTTP/1.1 206 Partial Content is used for delivering the "merged" content.
The problem is, that there are still two ID3 Tags from both files and most audio players only read the first one, which occurs wrong duration displays. The only case it works 100% is in VLC after downloading the whole thing. No webplayer, no iTunes etc. can manage the "merged" file duration.
Any idea how to create a "virtual ID3 Tag" in real time and how to remove the existing ones without touching the original files?

There are a lot of inaccurate conclusions you've come to, so let me start by correcting those, which may help you solve the problem.
because ffmpeg & Co. take too much time
FFmpeg can merge these audio streams faster than you can stream to clients for sure. If you're using -codec copy (which you should be in this case), it will handle all the demuxing/muxing for you. And, keep in mind that you can stream directly out of FFmpeg. No need for an intermediary file.
The practical case is to add prerolls to podcast files.
The FFmpeg route is what you want.
What my coworker already did is telling the header that two files are coming in a row by reading both files and echoing them via PHP. HTTP/1.1 206 Partial Content is used for delivering the "merged" content.
That's a bit of a wonky way to do this. You could instead just merge the data and send it directly in a single response.
The problem is, that there are still two ID3 Tags from both files and most audio players only read the first one, which occurs wrong duration displays.
No, the usual ID3 tags don't indicate duration. (There is an extension which does, but this is rarely used.) There is nothing in the bare MP3 stream that indicates duration either. Clients estimate this based on file size and bitrate. The bitrate can change mid-stream, so they usually estimate based on the bitrate of the first couple frames.
Undoubtedly, the problem in your case is incorrect length headers due to the way you're handling this merging, and/or a mismatch of bitrate which causes the length estimate from the player to be wrong.
Any idea how to create a "virtual ID3 Tag" in real time and how to remove the existing ones without touching the original files?
I would absolutely use FFmpeg for this work. If anything, because not all podcasts use MP3. There are plenty of AAC in MP4 podcasts, and a handful of Opus in WebM as well.

Best practices to export CSV in PHP: output buffer vs temporary file

Best practices to export CSV in PHP: output buffer vs temporary file
Scenario
I execute a SELECT on a database that returns any number of rows, may be few or many (one million+), those rows need to go inside a .csv file with the first row beeing header.
Doubt
I know two ways of exporting CSV files with PHP: using output buffer php://output or creating a temporary file, serve it to user, than delete it.
Which way is better, knowing it may be a small file or a very big one? Consider PHP memory limit (in php.ini), request time out, etc.

Using the temporary file in case you have large file is the only good option.
you can redirect second request(if file exist) directly to your file and let web server to serve it without executing php.
if client has disconnected, while download a file through api, - in most cases he will start downloading again;
more of that, you will got access logs on your web server, to check who and how many times access this file.

It depends on the situation.
Use an output buffer when you know the file is not ridiculously large and when it is a download that doesn't occur to often.
When you have something large, that will be downloaded a large number of times (simultaneous), writing it to a file might be better to lighten the load on your database and site.

I'd think the answer is pretty obvious: write directly to php://output. It's the same as echo ..; the output will be send to the client more or less directly. It may or may not get buffered for a bit, but unless you have explicit output buffering activated or your web server has a ridiculously large buffer, it'll send it right through. "Sending a file" (presumably via readfile) would pass the data though the same output buffer, but would be much more complicated and error prone.

Only grab completed files

I'm making a real simple "backend" (PHP5) for two flash/air-applications. One of them will upload a photo, the backend will save it to a folder, and the second app will poll the backend for new photo's and show them.
I don't got any access to a database, so the backend has to be pure PHP5 and nothing more. That's why I chose to save the images to a folder (with a timestamp in their names) and use readdir() to get them back.
This all works like a charm. Nevertheless, I would really like to make sure the backend only returns photo's that are completely uploaded, preventing the second app to try to load an unfinished image. Are there any methods/tricks that I can use to validate a file?

You could check the filesize a couple hundred milliseconds apart and see if it changes:
$first = filesize($file);
// wait 100ms
usleep(100000);
$second = filesize($file);
if($first == $second) {
// file is no longer being actively uploaded
}

The usual trick for atomic filesystem operations is to write into a temporary file that is not matched by the reader (e.g. XXX.jpg.tmp) and once it's completely uploaded, rename it to it's target name. Renames on the same volume are atomic, so there is no point where the file is either uncomplete or unavailable.

A really easy and common way to do so would be to create a trigger file based on the files name, so that you get something like
123.jpg
123.rdy
or
123.jpg
123.jpg.rdy
You create that file (just an empty stub) as soon as the upload is complete. The application that grabs files to load only cares about files with a trigger file and then processes those. Alternatively, you could also save the uploaded file as ie. 123.bsy or 123.jpg.bsy while it is still being uploaded and then rename it to the finale name 123.jpg after the upload is done. Since renames in the same directory are usually really cheap operations in term of processing time, the chances of running in a race condition should be pretty low. (This might or might now depend on the OS used, though ...)
If you need to keep the files in that place, you could, of course, use a database where you add a record for each file, as the upload is complete. The other app could then just provide files with a matching database record.

After writing this all down I figgered it out myself. What I did was adding the exact amount of bytes in the filename as well and validate that while outputting the list of images. The .tmp/.bsy-sollution is nice also, but I read it a bit to late :)
Upside to my solution is that no more renaming is required after the upload is done. Thanks everybody for your fast answers!

PHP function for getting file size and loading time for a page

Similar to my last question, I'd like to have a PHP function that can take a local path and tell me (a) how much the total file size is for HTML, CSS, JS and images, and (b) the total load time for this page. Like YSlow I think, but as a PHP function.
Any thoughts? I looked around and was wondering can I use CURL for this? Even though I need to check paths that are on my own server? thanks!
Update:
After reading the comments, realizing I'm off base. Instead wondering is there a way programatically get a YSlow score for a page (or similar performance score). I assume it would need to hit a third-party site that would act as the client. I'm basically trying to loop through a group of pages and get some sort of performance metric. Thanks!

For the filesize.
Create a loop to read all files in a specific directory with dir.
Then for each file use filesize.
Loadtime
Loadtime depends on the connection speed and the filesize. And I see that you specify that you are reading locally the files. You can detect how much time it take you to read those files but this will not be the loadtime for the page for an external user.

I want to create multiple thumbnails using GD library in php, which is better creating on the fly or creating physical one?

I want to create multiple thumbnails using GD library in php, and I already have a script to do this, the question is what is better for me .. is it better to create thumbnail on the fly? or create a physical file on my server each time I want a thumb?? and Why?
Please, consider time consuming and storage capacity and other disadvantages for both

When you create the thumbnail depends on a couple of factors (that I'll get into) but you should never discard the output of something like this (unless you'll never use it again) as it's a really expensive operation.
Anyway your two main choices for "when to generate the thumbnail" are:
When it's first requested. This is common and it means that you don't generate thumbnails that are never used but it does mean if you have a page full of first-time-thumbnails that the server might become overwhelmed with PHP processes generating the thumbnails.
I had a similar issue with Sorl+Django where I was generating 100+ thumbnails per request for the first few requests after uploading and it basically made the entire server hang for 20 minutes. Not good.
Generate all required thumbnails when you upload. Because it takes a long time to upload, you break down the processing quite a lot. You can also pull it out-of-process (ie use another script to process uploads - perhaps not even in PHP).
The obvious downside is you're using up disk space that you otherwise might not need to use up... But unless you're talking about hundreds of thousands of thumbnails, a small percentage of unused ones probably won't break the bank.
Of course, if disk space is an issue, there might be an argument for pushing the thumbnail up to a CDN at the same time as you process it.
One note when you save the thumbnails, it's fairly common that you'll want to resize the thumbnails at some point down the line or perhaps want two small variants. I find it really useful to make the filenames very specific so if the original image is image.jpg, the 200x200 version is image-200x200.jpg.

Neither/both - don't generate the thumbnails till you need them - but keep the files you generate.
That way you'll minimise the amount of work needed and have a self-repairing system
C.

GD is really resource heavy, so you should look at if you can use ImageMagick instead (which also has a clearer syntax).
You definitely will be better off caching the created thumbnail after the first run (regardless of if you run GD or ImageMagick) and serve them from the cache. If you are worried about storage, clear out old files from the cache now and then.

Always cache (= write out to disk) the results of GD operations. They are too expensive both regarding processor time and memory to be done on the fly every time. This becomes increasingly true the more visitors/hits you have.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.