I'm working on an auto-update solution, and I'm using Amazon S3 for distribution.
I would like for this to work like follows:
I upload a file to s3 folder
An automatic PHP script detects that a new file has been added and notifies clients
To do this, I somehow need to list all files in an amazon bucket's folder, and find the one which has been added last.
I've tried $s3->list_objects("mybucket");, but it returns the list of all objects inside the bucket, and I don't see an option to list only files inside the specified folder.
What is the best way to do this using Amazon S3 PHP api?
To do this, I somehow need to list all files in an amazon bucket's folder, and find the one which has been added last.
S3's API isn't really optimized for sort-by-modified-date, so you'd need to call list_buckets() and check each timestamp, always keeping track of the newest one until you get to the end of the list.
An automatic PHP script detects that a new file has been added and notifies clients
You'd need to write a long-running PHP CLI script that starts with:
while (true) { /*...*/ }
Maybe throw an occasional sleep(1) in there so that your CPU doesn't spike so badly, but you essentially need to sleep-and-poll, looping over all of the timestamps each time.
I've tried $s3->list_objects("mybucket");, but it returns the list of all objects inside the bucket, and I don't see an option to list only files inside the specified folder.
You'll want to set the prefix parameter in your list_objects() call.
S3 launched versioning functionality of files in bucket http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html.
You could get latest n files by calling s3client.listVersions(request) and specifying n if you want.See http://docs.aws.amazon.com/AmazonS3/latest/dev/list-obj-version-enabled-bucket.html
Example is in java. Not sure if the API for PHP is added for versioning.
Related
I have application in PHP on cluster server. It copy file from aws bucket on server process the file (unzip file. convert PDF to XML using itext java, Read XML and save data to database) and the upload processed file back to bucket.
It works fine for single instance but in load balancing for multiple instances file under process on server disappears.
I can not process file directly from bucket as I can not unzip it on bucket also can not run jar file on bucket. So I have to store file
temporary for processing. Is there any way to handle this situation
A few possible solutions:
Use a central single key value store (database) to store the path of the file's that you are currently processing, when downloading a new file, check if this file isn't already being. You could use Redis for this
Upload a new, empty, file to S3, but with something in the file name so you know that if that file is present, the accompanying file is already being processed (Though I'm not sure if S3 caches directory listings) With this solution you should also consider the cost writing a file to S3, that also depends on your scale
Rename or remove the file from S3 while it's being processed
There can be multiple solutions to this:
One solution is to check and apply tags if the file is processed at the time of upload apply some tag like processed=true and when you are downloading files check for tags.
Better solution is to use lambda for this task.
You can use the pattern of
S3 to lambda
Lambda drops a message in SQS
Application monitors SQS
Application processes file
Delete message.
Or just have lambda do all the work on S3 upload. Depending on how long the process runs. Execution time is 5 mins.
http://docs.aws.amazon.com/lambda/latest/dg/limits.html
For example:
Set up a lambda function to monitor the s3 on upload new object event. Then have the lambda function drop a message in SQS(From the event data it receives, the Lambda function knows the source bucket name and object key name). The server can monitor the queue, process the message, extract the file and upload it to a new bucket, delete the file from the old s3 bucket and then delete the message from the queue. If the server dies during processing, the message goes back onto the queue(visibility timeout). A way to ensure it is processed and deleted on the old bucket is to enable versioning and a life cycle policy. When processing the message if the files doesn't exist on the old bucket send an alert and/or check for the previous version. You can also have a life cycle policy on the old bucket to permanently delete version if they are older than X days.
Monitoring S3 with Lambda
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
s3 Versioning
http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
Select Permanently delete previous versions and then enter the number of days after an object becomes a previous version to permanently delete the object (for example, 455 days).
http://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
What you need is a system which will store the file without losses. There are many alternatives for that.
a) Another server
b) An SQS Queue.
#strongiz answer above explains it very well.
c) Even another database.
In each of these cases, you need a flag which will define if file is processed or not. when file processing is complete
a) delete the file or,
b) Change the flag
Since, PHP is session oriented, you cant store data there permannently, so, you need to connect to another interface. In case of a database, You can actually store a the file path entry and a flag to determine if file is processed or not. So, even a combo of the 3 might also work.
I want to get the size of a folder in Google Cloud. I know I can do this by retrieving all files and loop through the files and summarize all files sizes, but this is a time consuming action. I want to do this through an API.
Is there any function for this?
No, there is not. You'd either need to compute it when needed or keep a running tally via notifications.
Amazon php sdk (http://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-transfer.html) sync operations are performed using the above linked methods. I've successfully coded and configured them, but on each call for the method the last modified date on files under the bucket gets updated to the latest time, given that the files were not modified locally to the previous sync call.
I wonder whether is it a sync operation at all or just overwriting operation of whatever is sent from the local directory?
Why this matters is we are planning to sync gigs of files in between a server and S3 bucket. Using S3 buckets as backup storage, in case of any disruptions we can sync (S3 bucket -> server) the opposite way to make the missing pieces of data available in our server.
Notes:
I've also tried this one from here
Currently I'm using version 3 of the AWS php sdk
Unfortunately I believe the answer is no, I also see complete upload of every file when using the Transfer class.
It used to work, from the v2 API docs:
The uploadDirectory() method of a client will compare the contents of
the local directory to the contents in the Amazon S3 bucket and only
transfer files that have changed.
Perfect, that's what we want!
However, in v3 they have preserved S3Client::uploadDirectory() for API compatibility I guess, but it's just a wrapper for Transfer::promise(), which as we know just uploads without any syncing. Keeping API compatibility but changing behavior like this seems like a really bad idea to me.
I ended up having to add support to my project to use the AWS CLI tool for the actual uploading, which does support sync perfectly. Far from ideal.
If there is a way to use the Transfer class for easy syncing instead of complete uploading, I hope someone can prove me wrong.
I have a long running script which gets a (long) array of folders (with subarray of files in that folder) where I have to do several actions on each file.
What is the best way to make sure I make all actions successful? And how to handle unsuccessful actions?
Lets say what will happen if my mysql server is unavailable or like the Amazon S3 API is not working correctly.
pseudocode of my script:
starting script with folders / files array
looping through each folder
looping through each file in that folder
open file (from external server) and try converting it to custom object (only continue if file is a valid "object")
extract some parts of file and save them to Amazon S3 bucket
extract some other parts of file and save them to another Amazon S3 bucket
extract metadata / text of file and insert into elasticsearch
update mysql record
As mentioned, what you could do is throw and catch Exceptions.
So for instance, if you iterate over files in a folder using a foreach, doing something with those files, on an error, you can throw an Exception and it will stop code execution till it is catched.
So maybe you want to use a logger instead. Since it is 2014, you probably want to use a DIC to inject a logger service or otherwise, you can just use a singleton (only considering the great flaws that brings) that stores your errors.
So either way you have this service that stores every error.
At the end you just check if it has any errors and then act accordingly.
I have a bucket on Amazon S3 which contains hundreds of objects.
I have a web page that lists out all these objects and has a download object link in html.
This all works as expected and I can download each object individually.
How would it be possible to provide a checkbox next to each link, which allowed a group of objects to be selected and then only those objects downloaded?
So to be clear, if I chose items 1, 2, and 7 - and clicked a download link - only those object would be downloaded. This could be a zip file or one at a time although I have no idea how this would work.
I am capable of coding this up, but I am struggling to thing HOW it would work - so process descriptions are welcome. I could consider python or ruby although the web app is PHP.
I'm afraid this is a hard problem to solve.
S3 does not allow any 'in place' manipulation of files, so you cannot zip them up into a single download. In the browser, you a stuck with downloading one url at a time. Of course, there's nothing stopping the user queuing up downloads manually using a download manager, but there is nothing you can do to help with this.
So you are left with a server side solution. You'll need to download the files from S3 to a server and zip them up before delivering the zip to the client. Unfortunately, depending on the number and size of files, this'll probably take so time, so you need a notification system to let the user know when their file is ready.
Also, unless your server is running on EC2, you might be paying twice for bandwidth charges. S3 to your server and then your server to the client.