Zipping 100's of files stored on S3

Zipping 100's of files stored on S3 - php

We use S3 to store various media uploaded via our application such as images, documents etc. We work in the property software industry and as a means of exchanging data stored in our system with property portals a common exchange format between portals is the Rightmove BLM data feed specification. This is essentially a zip file containing a delimited text file and any associated media which is sent via FTP to each portal. However a bottleneck in the process is downloading the media from S3 for zipping. For example one single account on our system could have in the region of 1000 images/documents to the downloaded and zipped in preparation for transfer (each file has to be name in a particular format for that particular portal (unique number, sequence numbers etc). However downloading 1000 images/documents from S3 to an EC2 server in the same region via the PHP SDK takes some time (60+ seconds). When doing this for multiple accounts at the same time it puts considerable load on the server.
Is there an better/faster way to download files from S3 so they can be prepped and zipped on the EC2 instance?
Thanks.

One option would be to aggregate the zip as the files are added. Meaning, instead of zipping the files all at once, use a Lambda function to add them to a zip file as they're added to or updated on the S3 bucket. Then, the zip would be available more-or-less on demand.

Related

php - local cache for Google Cloud Storage media

We want to store a few TBs of video files on Google Cloud Storage, but our webservers which will serve the videos to visitors are dedicated servers in our datacenter.
To reduce the costs for outbound Google Cloud traffic we want to cache as much video files as possible locally on the webservers. A cronjob will delete the oldest files (atime) in the cache if needed.
When a visitor request a video file which is not in the local cache, we start to download it from GCS using the stream wrapper feature of google/cloud and a combination of fopen, fread and fwrite to the local cache file. We also ensure that the script finished via ignore_user_abort, even if the user aborts the playback. With this solution it is possible to send the video to the client and to save it locally simultaneously. The next request regarding this video will use the local cache file.
This works fine as long as the player does not support seeking, but we also want to provide seeking capability via HTTP_RANGE headers which will result in clients which will request only a partial content. Seeking to the right position of the video file in GCS works and seeking to the right position of the locally cached video file also works. What I need is a solution to create the locally cached video file step by step, depending on the requested partial content. As long as the requested range is not available in the local file, we download the partial content of the GCS file and write it to the local file. We have to mark the successfully cached parts and keep downloading the parts from GCS until the local file is complete. But I don't know how to share the information which parts are locally available between all the single HTTP-request. I feel that all these single request to the same video file with many different ranges from different clients sound like trouble... Any thoughts about this?

S3 or EFS - Which one is best to create dynamic image from set of images?

I have a quiz site which creates images from a set of source images, the result images are stored in S3 and i don't care about it. My question is about source images, S3 or EFS is better for storing the source images for this purpose. I am using php to create result images.

Here's a general rule for you: Always use Amazon S3 unless you have a reason to do otherwise.
Why?
It has unlimited storage
The data is replicated for resilience
It is accessible from anywhere (given the right permissions)
It has various cost options
Can be accessed by AWS Lambda functions
The alternative is a local disk (EBS) or a shared file system (EFS). They are more expensive, can only be accessed from EC2 and take some amount of management. However, they have the benefit that they act as a directly-attached storage device, so your code can reference it directly without having to upload/download.
So, if your code needs the files locally, the EFS would be a better choice. But if you code can handle S3 (download from it, use the files, upload the results), then S3 is a better option.

Given your source images will (presumably) be at a higher resolution than those you are creating, and that once processed, they will not need to be accessed regularly after while (again, presumably), I would suggest that the lower cost of S3 and the archiving options available there means it would be best for you. There's a more in depth answer here:
AWS EFS vs EBS vs S3 (differences & when to use?)

where to save images generated by php before moving to cloud

I have a laravel php app were a user is going to upload an image. This image is going to be converted into a number of different sizes as required around the application and then each image is going to be uploaded to aws s3.
When the user uploads the image php places it in /tmp until the request has completed if it hasnt been renamed. I am planning on pushing the job of converting and uploading the versions to a queue. What is the best way to ensure that the image stays in /tmp long enough to be converted and then uploaded to s3
Secondly where should I save the different versions so that I can access them to upload them to s3 and then remove them from the server(preferably automatically)?

I would create a new directory and work on it. tmp folder is flushed every now and then depending on your system.
As for different sizes, i would create separate buckets for each size which you can access with whatever constant you use to store the image (ex: email, user id, etc..).

Download a list of Objects in a bucket from S3

I have a bucket on Amazon S3 which contains hundreds of objects.
I have a web page that lists out all these objects and has a download object link in html.
This all works as expected and I can download each object individually.
How would it be possible to provide a checkbox next to each link, which allowed a group of objects to be selected and then only those objects downloaded?
So to be clear, if I chose items 1, 2, and 7 - and clicked a download link - only those object would be downloaded. This could be a zip file or one at a time although I have no idea how this would work.
I am capable of coding this up, but I am struggling to thing HOW it would work - so process descriptions are welcome. I could consider python or ruby although the web app is PHP.

I'm afraid this is a hard problem to solve.
S3 does not allow any 'in place' manipulation of files, so you cannot zip them up into a single download. In the browser, you a stuck with downloading one url at a time. Of course, there's nothing stopping the user queuing up downloads manually using a download manager, but there is nothing you can do to help with this.
So you are left with a server side solution. You'll need to download the files from S3 to a server and zip them up before delivering the zip to the client. Unfortunately, depending on the number and size of files, this'll probably take so time, so you need a notification system to let the user know when their file is ready.
Also, unless your server is running on EC2, you might be paying twice for bandwidth charges. S3 to your server and then your server to the client.

Is there a way to allow users of my site to download large volumes of image files from Amazon S3 via Flash / PHP / other service?

My website allows users to upload photographs which I store on Amazon's S3. I store the original upload as well as an optimized image and a thumbnail. I want to allow users to be able to export all of their original versions when their subscription expires. So I am thinking the following problems arise
Could be a large volume of data (possibly around 10GB)
How to manage the download process - eg make sure if it gets interrupted where to start from again, how to verify successful download of files
Should this be done with individual files or try and zip the files and download as one file or a series of smaller zipped files.
Are there any tools out there that I can use for this? I have seen Fzip which is an Actionscript library for handling zip files. I have an EC2 instance running that handles file uploads so could use this for downloads also - eg copy files to EC2 from S3, Zip them then download them to user via Flash downloader, use Fzip to uncompress the zip folder to user's hard drive.
Has anyone come across a similar service / solution?
all input appreciated
thanks

I have not dealt with this problem directly but my initial thoughts are:
Flash or possibly jQuery could be leveraged for a homegrown solution, having the client send back information on what it has received and storing that information in a database log. You might also consider using Bit Torrent as a mediator, your users could download a free torrent client and you could investigate a server-side torrent service (maybe RivetTracker or PHPBTTracker). I'm not sure how detailed these get, but at the very least, since you are assured you are dealing with a single user, if they become a seeder you can wipe the old file and begin on the next.
Break larger than 2GB files into 2GB chunks to accommodate users with FAT32 drives that can't handle > ~4GB files. Break down to 1GB if space on the server is limited, keeping a benchmark on what's been zipped from S3 via a database record
Fzip is cool but I think it's more for client side archiving. PHP has ZIP and RAR libraries (http://php.net/manual/en/book.zip.php) you can use to round up files server-side. I think any solution you find will require you to manage security on your own by keeping records in a database of who's got what and download keys. Not doing so may lead to people leeching your resources as a file delivery system.
Good luck!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.