Amazon php sdk (http://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-transfer.html) sync operations are performed using the above linked methods. I've successfully coded and configured them, but on each call for the method the last modified date on files under the bucket gets updated to the latest time, given that the files were not modified locally to the previous sync call.
I wonder whether is it a sync operation at all or just overwriting operation of whatever is sent from the local directory?
Why this matters is we are planning to sync gigs of files in between a server and S3 bucket. Using S3 buckets as backup storage, in case of any disruptions we can sync (S3 bucket -> server) the opposite way to make the missing pieces of data available in our server.
Notes:
I've also tried this one from here
Currently I'm using version 3 of the AWS php sdk
Unfortunately I believe the answer is no, I also see complete upload of every file when using the Transfer class.
It used to work, from the v2 API docs:
The uploadDirectory() method of a client will compare the contents of
the local directory to the contents in the Amazon S3 bucket and only
transfer files that have changed.
Perfect, that's what we want!
However, in v3 they have preserved S3Client::uploadDirectory() for API compatibility I guess, but it's just a wrapper for Transfer::promise(), which as we know just uploads without any syncing. Keeping API compatibility but changing behavior like this seems like a really bad idea to me.
I ended up having to add support to my project to use the AWS CLI tool for the actual uploading, which does support sync perfectly. Far from ideal.
If there is a way to use the Transfer class for easy syncing instead of complete uploading, I hope someone can prove me wrong.
Related
I'm developing an application that will run in a elastic environment on AWS (Ec2 instances with autoscaling). All the app is being developed in PHP.
The core of the app is based on safely storing files in a S3 bucket. As the user doesn't needs to know where was it saved, I thought that I could make this store the file temporarily in the EC2 instance and then asynchronously move it to S3, using a job queue (Amazon SQS) to avoid duplicating the wait time and having better support for s3 problems (they aren't common, but can happen).
My questions are:
Does this approach sounds good or I'm missing something?
When processing the job from the queue, the worker instance will have to connect to the original s3 instance, retrieve the file from it and then upload it to s3?
How can avoid having problems when the autoscaling? An instance could be deleted before I store the file in the S3 bucket.
Ideally, you don't want your main app server being tied during file uploads (both to the app server and subsequently to S3).
CORS (Cross Origin Resource Sharing) exists to avoid precisely this. You can upload the file to S3 directly from the client-side and let amazon worry about handling multiple uploads from your concurrent users. It lets your app do what it does best without having to worry about the uploads themselves.
This SO question discusses the same issue and there are several customisable plugins like fine uploader out there which can wrap around this with progress bars, etc.
This completely removes the need to make use of any kind of queue. If you need to do certain bookkeeping operations after the upload, you could simply make an ajax call to your server after the upload is complete with the file info, etc. It should also address any concerns you might have with instances being removed due to autoscaling since everything is client side.
I can't seem to find sync and copy options in the php sdk for amazon-s3. Currently I am using aws s3 cp/sync CLI command from within PHP code to achieve this which doesn't seem very neat to me. Any ideas?
You can do this with a combination of two features in the PHP SDK.
First, checkout the S3Client::uploadDirectory() method (see user guide). You can upload a while directory's contents like this:
$s3Client->uploadDirectory('/local/directory', 'your-bucket-name');
If you combine this with the S3 Stream Wrapper, then instead of using a local directory as the source, you can specify a bucket.
$s3Client->registerStreamWrapper();
$s3Client->uploadDirectory('s3://your-other-bucket-name', 'your-bucket-name');
This is pretty powerful, because you can even do something across regions if you use a differently configured S3Client object for the Stream Wrapper than the one doing the uploadDirectory().
I'm working on a project that is being hosted on Amazon Web Services. The server setup consists of two EC2 instances, one Elastic Load Balancer and an extra Elastic Block Store on which the web application resides. The project is supposed to use S3 for storage of files that users upload. For the sake of this question, I'll call the S3 bucket static.example.com
I have tried using s3fs (https://code.google.com/p/s3fs/wiki/FuseOverAmazon), RioFS (https://github.com/skoobe/riofs) and s3ql (https://code.google.com/p/s3ql/). s3fs will mount the filesystem but won't let me write to the bucket (I asked this question on SO: How can I mount an S3 volume with proper permissions using FUSE). RioFS will mount the filesystem and will let me write to the bucket from the shell, but files that are saved using PHP don't appear in the bucket (I opened an issue with the project on GitHub). s3ql will mount the bucket, but none of the files that are already in the bucket appear in the filesystem.
These are the mount commands I used:
s3fs static.example.com -ouse_cache=/tmp,allow_other /mnt/static.example.com
riofs -o allow_other http://s3.amazonaws.com static.example.com /mnt/static.example.com
s3ql mount.s3ql s3://static.example.com /mnt/static.example.com
I've also tried using this S3 class: https://github.com/tpyo/amazon-s3-php-class/ and this FuelPHP specific S3 package: https://github.com/tomschlick/fuel-s3. I was able to get the FuelPHP package to list the available buckets and files, but saving files to the bucket failed (but did not error).
Have you ever mounted an S3 bucket on a local linux filesystem and used PHP to write a file to the bucket successfully? What tool(s) did you use? If you used one of the above mentioned tools, what version did you use?
EDIT
I have been informed that the issue I opened with RioFS on GitHub has been resolved. Although I decided to use the S3 REST API rather than attempting to mount a bucket as a volume, it seems that RioFS may be a viable option these days.
Have you ever mounted an S3 bucket on a local linux filesystem?
No. It's fun for testing, but I wouldn't let it near a production system. It's much better to use a library to communicate with S3. Here's why:
It won't hide errors. A filesystem only has a few errors codes it can send you to indicate a problem. An S3 library will give you the exact error message from Amazon so you understand what's going on, log it, handle corner cases, etc.
A library will use less memory. Filesystems layers will cache lots of random stuff that you many never use again. A library puts you in control to decide what to cache and not to cache.
Expansion. If you ever need to do anything fancy (set an ACL on a file, generate a signed link, versioning, lifecycle, change durability, etc), then you'll have to dump your filesystem abstraction and use a library anyway.
Timing and retries. Some fraction of requests randomly error out and can be retried. Sometimes you may want to retry a lot, sometimes you would rather error out quickly. A filesystem doesn't give you granular control, but a library will.
The bottom line is that S3 under FUSE is a leaky abstraction. S3 doesn't have (or need) directories. Filesystems weren't built for billions of files. Their permissions models are incompatible. You are wasting a lot of the power of S3 by trying to shoehorn it into a filesystem.
Two random PHP libraries for talking to S3:
https://github.com/KnpLabs/Gaufrette
https://aws.amazon.com/sdkforphp/ - this one is useful if you expand beyond just using S3, or if you need to do any of the fancy requests mentioned above.
Quite often, it is advantageous to write files to the EBS volume, then force subsequent public requests for the file(s) to route through CloudFront CDN.
In that way, if the app must do any transformations to the file, it's much easier to do on the local drive & system, then force requests for the transformed files to pull from the origin via CloudFront.
e.g. if your user is uploading an image for an avatar, and the avatar image needs several iterations for size & crop, your app can create these on the local volume, but all public requests for the file will take place through a cloudfront origin-pull request. In that way, you have maximum flexibility to keep the original file (or an optimized version of the file), and any subsequent user requests can either pull an existing version from cloud front edge, or cloud front will route the request back to the app and create any necessary iterations.
An elementary example of the above would be WordPress, which creates multiple sized/cropped versions of any graphic image uploaded, in addition to keeping the original (subject to file size restrictions, and/or plugin transformations). CDN-capable WordPress plugins such as W3 Total Cache rewrite requests to pull through CDN, so the app only needs to create unique first-request iterations. Adding browser caching URL versioning (http://domain.tld/file.php?x123) further refines and leverages CDN functionality.
If you are concerned about rapid expansion of EBS volume file size or inodes, you can automate a pruning process for seldom-requested files, or aged files.
I'm teaching myself JavaScript and PHP by building an app, and I decided I would like to use Amazon EC2 and S3 as the platform. My question is about using S3 as a "database", but I'll start with a bit of background.
The app uses this class to interact with S3 buckets: http://undesigned.org.za/2007/10/22/amazon-s3-php-class/documentation#getObject
When a user logs into the app, the app will download a file from a S3 bucket. Every user has their own file. Using JSON, it will then bring the data client side, and then most of the "interaction" is client side (using JavaScript) and a bit of PHP. Once the user is done (probably after 30 minutes or so), the app will then save/upload and replace the S3 file.
My reasoning behind all of this is that I think the app will be very scalable. My hope is that I can use load balancing, with each instance being able to interact directly with S3. If lots of users log on, I can simply create lots of "micro" or "small" instances to handle them all. One of the drawbacks of EC2 is if the instance crashes or goes offline, all the data is lost, so my thoughts are that instead of constantly having to back up everything - why not build the app around S3 in the first place?
My question: Does this make sense? Is there a reason I haven't seen many examples of this kind of thing "in the real world"?
Thank you so much for your time!
Cheers,
I just visited this site that has a good description of experiences with using S3 as database
http://petewarden.typepad.com/searchbrowser/2010/10/how-i-ended-up-using-s3-as-my-database.html
Have you seen Amazon SimpleDB?
Creating your own datastore and storing it on S3 doesn't sound that practical, especially as you have to upload and download a file every 30 minuites, that hardly sounds scalable to me! What if your server goes down or the file gets lost?
You can run MySQL and other databases on Amazon, why not back those up (say daily) to S3 instead.
I will be launching an application in the very near future which will, in part, require users to upload files (images) to be viewed by other members. I like the idea of S3 as it is relatively cheap and scales automatically.
My problem is how I will have users upload their images to S3. It seems there are a few options.
1- Use the php REST API. The only problem is that I can't get it to work for uploading variously scaled versions (ie thumbnails) of the same image simultaneously and uploading them directly to s3 (it works for just one image at a time this way). Overall, it just seems less flexible.
http://net.tutsplus.com/tutorials/php/how-to-use-amazon-s3-php-to-dynamically-store-and-manage-files-with-ease/
2- The other option would be to mount an S3 bucket with s3fs. Then just programmatically move my images into the bucket like I would with NFS. From what I've read, it seems some people are dubious of the reliability of mounting S3. Is this true?
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=fuse+over+amazon
Which method would be better for maximum reliability and speed?
Would EBS be something to consider? I would really like to have a dedicated box rather than use an EC2 instance, though...
For your use case I recommend to use the S3 API directly rather than using s3fs because of performance. Remember that s3fs is just another layer on top of S3's API and it's usage of that API is not always the best one for your application.
To handle the creation of thumbnails, I recommend to decouple that from the main upload process by using Amazon Simple Queue Service. That way your users will receive a response as soon as a file is uploaded without having to wait for it to be processed resulting in shorter response times.
As for using EBS, that is a different scenario. EBS is just a persistent storage for and Amazon EC2 instance and it's reliability doesn't compare with S3.
It's also important to remember that S3 only offers "eventual consistency" as opposed to a physical HDD on your machine or an EBS instance on EC2 so you need to code your app to handle that correctly.