AWS S3 file upload over PHP and HTTP - php

I'm wanting to use Amazons S3 service to store the files that users upload to my LAMP package. I'm wondering what would be the best (time, cost, security, etc) to do this. I'm familar with uploading files using HTTP with PHP processing but I've always saved it to the local storage. Should I have a tmp directory that I have the SDK upload from or should or even could I upload the file to S3 from a data variable? Also I would like to be able to handle 5 GB files but at the moment I'm only running a half a GB of ram, would this cause any issues while I'm in the alfa of my project? Keep in mind my web server is an EC2 server. Thanks for the help.

Q1: Uploading From memory without creating temporary file.
Yes, you can do it. The Amazon SDK has "putObjectFile" and "putObjectString" functions, the first creates an object from a temporary file, the second from a string.
Q2: Uploading large files (5GB).
While you can get a server with 5GB of memory, it's a bit overkill just to store the data to upload entriely in memory while the upload happens - so going for a temporary file for upload and streaming chunk by chunk from that file would probably be wise. To handle chunks in curl in PHP, you may need to add CURLOPT_READFUNCTION that reads a bit of the file at a time for upload.
The name of a callback function where the callback function takes three parameters. The first is the cURL resource, the second is a stream resource provided to cURL through the option CURLOPT_INFILE, and the third is the maximum amount of data to be read. The callback function must return a string with a length equal or smaller than the amount of data requested, typically by reading it from the passed stream resource. It should return an empty string to signal EOF.
You can find the curl functions in the Amazon SDK, function getResponse(). The class is labelled "final" so you'll need to actually modify the SDK to add this in.
Q3: Costs. A server with a slightly larger hard disk (to store temp files) is most likely cheaper than adding memory.
Q4: Security. You can store your temp files out of web root, so they will be as secure as your webserver is. If your webserver gets compromised, they'll get your Amazon Secret Key anyway - so this shouldn't really be any more concern than protecting the rest of your application.

Related

How do I send a file upload from an HTML form to S3 via PHP without local storage?

I'm trying to convert a website to use S3 storage instead of local (expensive) disk storage. I solved the download problem using a stream wrapper interface on the S3Client. The upload problem is harder.
It seems to me that when I post to a PHP endpoint, the $_FILES object is already populated and copied to /tmp/ before I can even intercept it!
On top of that, the S3Client->upload() expects a file on the disk already!
Seems like a double-whammy against what I'm trying to do, and most advice I've found uses NodeJS or Java streaming so I don't know how to translate.
It would be better if I could intercept the code that populates $_FILES and then send up 5MB chunks from memory with the S3\ObjectUploader, but how do you crack open the PHP multipart handler?
Thoughts?
EDIT: It is a very low quantity of files, 0-20 per day, mostly 1-5MB sometimes hitting 40~70MB. Periodically (once every few weeks) a 1-2GB file will be uploaded. Hence the desire to move off an EC2 instance and into heroku/beanstalk type PaaS where I won't have much /tmp/ space.
It's hard to comment on your specific situation without knowing the performance requirements of the application and the volume of users needed to access it so I'll try to answer assuming a basic web app uploading profile avatars.
There are some good reasons for this, the file is streamed to the disk for multiple purposes one of which is to conserve memory use. If your file is not on the disk than it is in memory(think disk usage is expensive? bump up your memory usage and see how expensive that gets), which is fine for a single user uploading a small file, but not so great for a bunch of users uploading small files or worse: large files. You'll likely see the best performance if you use the defaults on these libraries and let them stream to and from the disk.
But again I don't know your use case and you may actually need to avoid the disk at all costs for some unknown reason.

S3 or EFS - Which one is best to create dynamic image from set of images?

I have a quiz site which creates images from a set of source images, the result images are stored in S3 and i don't care about it. My question is about source images, S3 or EFS is better for storing the source images for this purpose. I am using php to create result images.
Here's a general rule for you: Always use Amazon S3 unless you have a reason to do otherwise.
Why?
It has unlimited storage
The data is replicated for resilience
It is accessible from anywhere (given the right permissions)
It has various cost options
Can be accessed by AWS Lambda functions
The alternative is a local disk (EBS) or a shared file system (EFS). They are more expensive, can only be accessed from EC2 and take some amount of management. However, they have the benefit that they act as a directly-attached storage device, so your code can reference it directly without having to upload/download.
So, if your code needs the files locally, the EFS would be a better choice. But if you code can handle S3 (download from it, use the files, upload the results), then S3 is a better option.
Given your source images will (presumably) be at a higher resolution than those you are creating, and that once processed, they will not need to be accessed regularly after while (again, presumably), I would suggest that the lower cost of S3 and the archiving options available there means it would be best for you. There's a more in depth answer here:
AWS EFS vs EBS vs S3 (differences & when to use?)

File processing on load balancing server (Cluster)

I have application in PHP on cluster server. It copy file from aws bucket on server process the file (unzip file. convert PDF to XML using itext java, Read XML and save data to database) and the upload processed file back to bucket.
It works fine for single instance but in load balancing for multiple instances file under process on server disappears.
I can not process file directly from bucket as I can not unzip it on bucket also can not run jar file on bucket. So I have to store file
temporary for processing. Is there any way to handle this situation
A few possible solutions:
Use a central single key value store (database) to store the path of the file's that you are currently processing, when downloading a new file, check if this file isn't already being. You could use Redis for this
Upload a new, empty, file to S3, but with something in the file name so you know that if that file is present, the accompanying file is already being processed (Though I'm not sure if S3 caches directory listings) With this solution you should also consider the cost writing a file to S3, that also depends on your scale
Rename or remove the file from S3 while it's being processed
There can be multiple solutions to this:
One solution is to check and apply tags if the file is processed at the time of upload apply some tag like processed=true and when you are downloading files check for tags.
Better solution is to use lambda for this task.
You can use the pattern of
S3 to lambda
Lambda drops a message in SQS
Application monitors SQS
Application processes file
Delete message.
Or just have lambda do all the work on S3 upload. Depending on how long the process runs. Execution time is 5 mins.
http://docs.aws.amazon.com/lambda/latest/dg/limits.html
For example:
Set up a lambda function to monitor the s3 on upload new object event. Then have the lambda function drop a message in SQS(From the event data it receives, the Lambda function knows the source bucket name and object key name). The server can monitor the queue, process the message, extract the file and upload it to a new bucket, delete the file from the old s3 bucket and then delete the message from the queue. If the server dies during processing, the message goes back onto the queue(visibility timeout). A way to ensure it is processed and deleted on the old bucket is to enable versioning and a life cycle policy. When processing the message if the files doesn't exist on the old bucket send an alert and/or check for the previous version. You can also have a life cycle policy on the old bucket to permanently delete version if they are older than X days.
Monitoring S3 with Lambda
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
s3 Versioning
http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
Select Permanently delete previous versions and then enter the number of days after an object becomes a previous version to permanently delete the object (for example, 455 days).
http://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
What you need is a system which will store the file without losses. There are many alternatives for that.
a) Another server
b) An SQS Queue.
#strongiz answer above explains it very well.
c) Even another database.
In each of these cases, you need a flag which will define if file is processed or not. when file processing is complete
a) delete the file or,
b) Change the flag
Since, PHP is session oriented, you cant store data there permannently, so, you need to connect to another interface. In case of a database, You can actually store a the file path entry and a flag to determine if file is processed or not. So, even a combo of the 3 might also work.

Serving large file downloads from remote server

We have files that are hosted on RapidShare which we would like to serve through our own website. Basically, when a user requests http://site.com/download.php?file=whatever.txt, the script should stream the file from RapidShare to the user.
The only thing I'm having trouble getting my head around is how to properly stream it. I'd like to use cURL, but I'm not sure if I can read the download from RapidShare in chunks and then echo them to the user. The best way I've thought of so far is to use a combination of fopen, fread, echo'ing the chunk of the file to the user, flushing, and repeating that process until the entire file is transferred.
I'm aware of the PHP readfile() function aswell, but would that be the best option? Bear in mind that these files can be several GB's in size, and although we have servers with 16GB RAM I want to keep the memory usage as low as possible.
Thank you for any advice.
HTTP has a Header called "Range" which basically allows you to fetch any chunk of a file (knowing that you already know the file size), but since PHP isn't multi-threaded aware, I don't see any benefit of using it.
Afaik, if you don't want to consume all your RAM, the only way to go is a two steps way.
First, stream the remote file using fopen()/fread() (or any php functions which allow you to use stream), split the read in small chunks (2048 bits may be enough), write/append the result to a tempfile(), then "echoing" back to your user by reading the temporary file.
That way, even a file 2To would, basically, consumes 2048 bits since only the chunk and the handle of the file is in memory.
You may also write some kind of proxy manager to cache and keep already downloaded files to avoid the remote reading process if a file is heavily downloaded (and keep it locally for a given time).

Is there a way to allow users of my site to download large volumes of image files from Amazon S3 via Flash / PHP / other service?

My website allows users to upload photographs which I store on Amazon's S3. I store the original upload as well as an optimized image and a thumbnail. I want to allow users to be able to export all of their original versions when their subscription expires. So I am thinking the following problems arise
Could be a large volume of data (possibly around 10GB)
How to manage the download process - eg make sure if it gets interrupted where to start from again, how to verify successful download of files
Should this be done with individual files or try and zip the files and download as one file or a series of smaller zipped files.
Are there any tools out there that I can use for this? I have seen Fzip which is an Actionscript library for handling zip files. I have an EC2 instance running that handles file uploads so could use this for downloads also - eg copy files to EC2 from S3, Zip them then download them to user via Flash downloader, use Fzip to uncompress the zip folder to user's hard drive.
Has anyone come across a similar service / solution?
all input appreciated
thanks
I have not dealt with this problem directly but my initial thoughts are:
Flash or possibly jQuery could be leveraged for a homegrown solution, having the client send back information on what it has received and storing that information in a database log. You might also consider using Bit Torrent as a mediator, your users could download a free torrent client and you could investigate a server-side torrent service (maybe RivetTracker or PHPBTTracker). I'm not sure how detailed these get, but at the very least, since you are assured you are dealing with a single user, if they become a seeder you can wipe the old file and begin on the next.
Break larger than 2GB files into 2GB chunks to accommodate users with FAT32 drives that can't handle > ~4GB files. Break down to 1GB if space on the server is limited, keeping a benchmark on what's been zipped from S3 via a database record
Fzip is cool but I think it's more for client side archiving. PHP has ZIP and RAR libraries (http://php.net/manual/en/book.zip.php) you can use to round up files server-side. I think any solution you find will require you to manage security on your own by keeping records in a database of who's got what and download keys. Not doing so may lead to people leeching your resources as a file delivery system.
Good luck!

Categories