php - local cache for Google Cloud Storage media - php

We want to store a few TBs of video files on Google Cloud Storage, but our webservers which will serve the videos to visitors are dedicated servers in our datacenter.
To reduce the costs for outbound Google Cloud traffic we want to cache as much video files as possible locally on the webservers. A cronjob will delete the oldest files (atime) in the cache if needed.
When a visitor request a video file which is not in the local cache, we start to download it from GCS using the stream wrapper feature of google/cloud and a combination of fopen, fread and fwrite to the local cache file. We also ensure that the script finished via ignore_user_abort, even if the user aborts the playback. With this solution it is possible to send the video to the client and to save it locally simultaneously. The next request regarding this video will use the local cache file.
This works fine as long as the player does not support seeking, but we also want to provide seeking capability via HTTP_RANGE headers which will result in clients which will request only a partial content. Seeking to the right position of the video file in GCS works and seeking to the right position of the locally cached video file also works. What I need is a solution to create the locally cached video file step by step, depending on the requested partial content. As long as the requested range is not available in the local file, we download the partial content of the GCS file and write it to the local file. We have to mark the successfully cached parts and keep downloading the parts from GCS until the local file is complete. But I don't know how to share the information which parts are locally available between all the single HTTP-request. I feel that all these single request to the same video file with many different ranges from different clients sound like trouble... Any thoughts about this?

Related

Best way to "pipe" file contents from a remote server, via a 2nd server, output to browser

I have numerous storage servers, and then more "cache" servers which are used to load balance downloads. At the moment I use RSYNC to copy the most popular files from the storage boxes to the cache boxes, then update the DB with the new server IDs, so my script can route the download requests to a random box which has the file.
I'm now looking at better ways to distribute the content, and wondering if it's possible to route requests to any box at random, and the download script then check if the file exists locally, if it doesn't, it would "get" the file contents from the remote storage box, and output the content in realtime to the browser, whilst keeping the file on the cache box so that the next time the same request is made, it can just serve the local copy, rather than connecting to the storage box again.
Hope that makes sense(!)
I've been playing around with RSYNC, wget and cURL commands, but I'm struggling to find a way to output the data to browser as it comes in.
I've also been reading up on reverse proxies with nginx, which sounds like the right route... but it still sounds like they require the entire file to be downloaded from the origin server to the cache server before it can output anything to the client(?) some of my files are 100GB+ and each server has a 1gbps bandwidth limit, so at best, it would take 100s to download a file of that size to the cache server before the client will see any data at all. There must be a way to "pipe" the data to the client as it streams?
Is what I'm trying to achieve possible?
You can pipe data without downloading the full file using streams. One example for downloading a file as a stream would be the Guzzle sink feature. One example for uploading a file as a stream would be the Symfony StreamedResponse. Using those the following can be done:
Server A has a file the user wants
Server B gets the user request for the file
Server B uses Guzzle to setup a download stream to server A
Server B outputs the StreamedResponse directly to the user
Doing so will serve the download in real-time without having to wait for the entire file to be finished. However I do not know if you can stream to the user and store the file on disk at the same time. There's a stream_copy_to_stream function in PHP which might allow this, but don't know that for sure.

How can I mount an S3 bucket to an EC2 instance and write to it with PHP?

I'm working on a project that is being hosted on Amazon Web Services. The server setup consists of two EC2 instances, one Elastic Load Balancer and an extra Elastic Block Store on which the web application resides. The project is supposed to use S3 for storage of files that users upload. For the sake of this question, I'll call the S3 bucket static.example.com
I have tried using s3fs (https://code.google.com/p/s3fs/wiki/FuseOverAmazon), RioFS (https://github.com/skoobe/riofs) and s3ql (https://code.google.com/p/s3ql/). s3fs will mount the filesystem but won't let me write to the bucket (I asked this question on SO: How can I mount an S3 volume with proper permissions using FUSE). RioFS will mount the filesystem and will let me write to the bucket from the shell, but files that are saved using PHP don't appear in the bucket (I opened an issue with the project on GitHub). s3ql will mount the bucket, but none of the files that are already in the bucket appear in the filesystem.
These are the mount commands I used:
s3fs static.example.com -ouse_cache=/tmp,allow_other /mnt/static.example.com
riofs -o allow_other http://s3.amazonaws.com static.example.com /mnt/static.example.com
s3ql mount.s3ql s3://static.example.com /mnt/static.example.com
I've also tried using this S3 class: https://github.com/tpyo/amazon-s3-php-class/ and this FuelPHP specific S3 package: https://github.com/tomschlick/fuel-s3. I was able to get the FuelPHP package to list the available buckets and files, but saving files to the bucket failed (but did not error).
Have you ever mounted an S3 bucket on a local linux filesystem and used PHP to write a file to the bucket successfully? What tool(s) did you use? If you used one of the above mentioned tools, what version did you use?
EDIT
I have been informed that the issue I opened with RioFS on GitHub has been resolved. Although I decided to use the S3 REST API rather than attempting to mount a bucket as a volume, it seems that RioFS may be a viable option these days.
Have you ever mounted an S3 bucket on a local linux filesystem?
No. It's fun for testing, but I wouldn't let it near a production system. It's much better to use a library to communicate with S3. Here's why:
It won't hide errors. A filesystem only has a few errors codes it can send you to indicate a problem. An S3 library will give you the exact error message from Amazon so you understand what's going on, log it, handle corner cases, etc.
A library will use less memory. Filesystems layers will cache lots of random stuff that you many never use again. A library puts you in control to decide what to cache and not to cache.
Expansion. If you ever need to do anything fancy (set an ACL on a file, generate a signed link, versioning, lifecycle, change durability, etc), then you'll have to dump your filesystem abstraction and use a library anyway.
Timing and retries. Some fraction of requests randomly error out and can be retried. Sometimes you may want to retry a lot, sometimes you would rather error out quickly. A filesystem doesn't give you granular control, but a library will.
The bottom line is that S3 under FUSE is a leaky abstraction. S3 doesn't have (or need) directories. Filesystems weren't built for billions of files. Their permissions models are incompatible. You are wasting a lot of the power of S3 by trying to shoehorn it into a filesystem.
Two random PHP libraries for talking to S3:
https://github.com/KnpLabs/Gaufrette
https://aws.amazon.com/sdkforphp/ - this one is useful if you expand beyond just using S3, or if you need to do any of the fancy requests mentioned above.
Quite often, it is advantageous to write files to the EBS volume, then force subsequent public requests for the file(s) to route through CloudFront CDN.
In that way, if the app must do any transformations to the file, it's much easier to do on the local drive & system, then force requests for the transformed files to pull from the origin via CloudFront.
e.g. if your user is uploading an image for an avatar, and the avatar image needs several iterations for size & crop, your app can create these on the local volume, but all public requests for the file will take place through a cloudfront origin-pull request. In that way, you have maximum flexibility to keep the original file (or an optimized version of the file), and any subsequent user requests can either pull an existing version from cloud front edge, or cloud front will route the request back to the app and create any necessary iterations.
An elementary example of the above would be WordPress, which creates multiple sized/cropped versions of any graphic image uploaded, in addition to keeping the original (subject to file size restrictions, and/or plugin transformations). CDN-capable WordPress plugins such as W3 Total Cache rewrite requests to pull through CDN, so the app only needs to create unique first-request iterations. Adding browser caching URL versioning (http://domain.tld/file.php?x123) further refines and leverages CDN functionality.
If you are concerned about rapid expansion of EBS volume file size or inodes, you can automate a pruning process for seldom-requested files, or aged files.

Download a list of Objects in a bucket from S3

I have a bucket on Amazon S3 which contains hundreds of objects.
I have a web page that lists out all these objects and has a download object link in html.
This all works as expected and I can download each object individually.
How would it be possible to provide a checkbox next to each link, which allowed a group of objects to be selected and then only those objects downloaded?
So to be clear, if I chose items 1, 2, and 7 - and clicked a download link - only those object would be downloaded. This could be a zip file or one at a time although I have no idea how this would work.
I am capable of coding this up, but I am struggling to thing HOW it would work - so process descriptions are welcome. I could consider python or ruby although the web app is PHP.
I'm afraid this is a hard problem to solve.
S3 does not allow any 'in place' manipulation of files, so you cannot zip them up into a single download. In the browser, you a stuck with downloading one url at a time. Of course, there's nothing stopping the user queuing up downloads manually using a download manager, but there is nothing you can do to help with this.
So you are left with a server side solution. You'll need to download the files from S3 to a server and zip them up before delivering the zip to the client. Unfortunately, depending on the number and size of files, this'll probably take so time, so you need a notification system to let the user know when their file is ready.
Also, unless your server is running on EC2, you might be paying twice for bandwidth charges. S3 to your server and then your server to the client.

Is there a way to use the CPU of a remote machine to convert in ffmpeg?

My site is built in PHP. I have a WWW server, where all the uploads end up for processing, and then they get rsynced to one of the 4 media servers. If there is a slow and steady stream of uploads, the WWW server converts them all reasonably quickly, but if a bunch of people upload something at the same time, they queue up, and it may take several hours for a file to be processed.
The media servers are typically idle, since serving files off SSD drives results in no iowait, so the CPU is just sitting there, and I wanted to utilize it for conversions.
What would be a good (simple) way to do that?
Have the WWW server copy the file to the media server, and run a continuous process there that converts them, and then informs the web server somehow.
Here's an example using a database for state communication:
Web server receives upload
Web server copies file to one of the media servers and creates a database entry with the state "NEW", and assigns it to the media server
The media server in question notices the new entry while polling the database every 10 seconds or so
The media server converts the video and updates the database entry to "PROCESSED" state
Video is now visible on the website.
The assignment could either be handled by handing the files over to the media servers in round-robin fashion, or you could even make them report their current workload so that the least busy server can be used.
(Also you have SSD storage for videos? I want some of that...)
Why is the webserver doing the ffmpeg conversion in the first place? it seems the media servers should be doing that anyway.
When a upload arrives, copy the upload the media server, round robin. You could detect the load on the media servers and copy to the least used one. Signal the server copied to do the ffmpeg conversion. When complete the media server copies the processed file to the rsync location for the rest of the servers.
seems fairly simple to me.
You could use ssh -e none hostname command < infile > outfile to execute the conversion process on the media server from the web server, piping the file in via stdin and getting the output file via stdout.
If your network is properly secured, you could use rsh instead, and avoid the computation overhead of the encryption, but you need to be very certain nobody else can get access to your media server, since rsh is a textbook example of a remote-execution exploit.

Is there a way to allow users of my site to download large volumes of image files from Amazon S3 via Flash / PHP / other service?

My website allows users to upload photographs which I store on Amazon's S3. I store the original upload as well as an optimized image and a thumbnail. I want to allow users to be able to export all of their original versions when their subscription expires. So I am thinking the following problems arise
Could be a large volume of data (possibly around 10GB)
How to manage the download process - eg make sure if it gets interrupted where to start from again, how to verify successful download of files
Should this be done with individual files or try and zip the files and download as one file or a series of smaller zipped files.
Are there any tools out there that I can use for this? I have seen Fzip which is an Actionscript library for handling zip files. I have an EC2 instance running that handles file uploads so could use this for downloads also - eg copy files to EC2 from S3, Zip them then download them to user via Flash downloader, use Fzip to uncompress the zip folder to user's hard drive.
Has anyone come across a similar service / solution?
all input appreciated
thanks
I have not dealt with this problem directly but my initial thoughts are:
Flash or possibly jQuery could be leveraged for a homegrown solution, having the client send back information on what it has received and storing that information in a database log. You might also consider using Bit Torrent as a mediator, your users could download a free torrent client and you could investigate a server-side torrent service (maybe RivetTracker or PHPBTTracker). I'm not sure how detailed these get, but at the very least, since you are assured you are dealing with a single user, if they become a seeder you can wipe the old file and begin on the next.
Break larger than 2GB files into 2GB chunks to accommodate users with FAT32 drives that can't handle > ~4GB files. Break down to 1GB if space on the server is limited, keeping a benchmark on what's been zipped from S3 via a database record
Fzip is cool but I think it's more for client side archiving. PHP has ZIP and RAR libraries (http://php.net/manual/en/book.zip.php) you can use to round up files server-side. I think any solution you find will require you to manage security on your own by keeping records in a database of who's got what and download keys. Not doing so may lead to people leeching your resources as a file delivery system.
Good luck!

Categories